MPoL-dev / MPoL

A flexible Python platform for Regularized Maximum Likelihood imaging
https://mpol-dev.github.io/MPoL/
MIT License
33 stars 11 forks source link

"Image cube contained negative pixel values" while using SimpleNet #167

Closed briannazawadzki closed 1 year ago

briannazawadzki commented 1 year ago

I am initializing a model using the precomposed SimpleNet. Sometimes, when training a model using entropy regularization, I get the following error:

Traceback (most recent call last):
  File "train_and_image.py", line 68, in <module>
    loss_val, loss_track = train(model, dataset, optimizer, config, device=device, writer=writer)
  File "/common_functions.py", line 45, in train
    + config["entropy"] * losses.entropy(sky_cube, config["prior_intensity"])
  File "/mpol_venv/lib/python3.8/site-packages/mpol/losses.py", line 180, in entropy
    assert (cube >= 0.0).all(), "image cube contained negative pixel values"
AssertionError: image cube contained negative pixel values

This did NOT happen when I set lambda_ent = 2e-1, but DID happen when I made the small change to lambda_ent = 3e-1 (or any value higher than that).

I do not expect to have any negative valued pixels in the image cube because the model is the precomposed SimpleNet module, initialized like model = precomposed.SimpleNet(coords=coords, nchan=dataset.nchan)

The SimpleNet is a simple network that creates BaseCube, ImageCube, and FourierCube. Softplus, which maps all negative values to a positive value, is the default pixel mapping option in the BaseCube. However, negative valued pixels are appearing in the image cube and when losses.entropy(sky_cube, config["prior_intensity"]) is called we get AssertionError: image cube contained negative pixel values

iancze commented 1 year ago

This is confusing! For debugging purposes, do you think you could wrap this loop inside of a try/catch statement and then, when the error occurs, save the skycube to a .npy file, and then plot it up for analysis? Gradient images would be interesting too. The value of the basecube might be helpful for generating ideas of what's going wrong too.

I'm sure there are smarter ways to debug using pdb, too.

briannazawadzki commented 1 year ago

I'm learning that the issue is not negative pixel values, but rather nans that appear after a certain number of iterations. In this case, if I run the loop for 3754 iterations, the process completes just fine but we do not converge on a minimum loss value. In fact, the loss just plummets lower and lower in a linear fashion, going into negative values:

Screen Shot 2023-02-25 at 12 17 28 PM

Not only is that weird behavior for the loss values, but the pixel values of the skycube plummet to super low values. In this case, the minimum pixel value is 3.52278e-161 and the maximum pixel value is 3.8696112e-159. That doesn't really make any sense, especially given that the entropy prior intensity is set to 1e-7.

This is the skycube tensor for that image:

failing on iteration 3755
sky_cube
tensor([[[4.2437e-161, 5.6202e-161, 5.5336e-161,  ..., 7.3562e-161,
          4.9498e-161, 4.4924e-161],
         [5.0311e-161, 6.9431e-161, 7.1965e-161,  ..., 7.2579e-161,
          5.8591e-161, 5.1970e-161],
         [4.4973e-161, 6.2928e-161, 6.7296e-161,  ..., 6.2333e-161,
          5.3283e-161, 4.5684e-161],
         ...,
         [6.5779e-161, 6.5825e-161, 5.3526e-161,  ..., 6.2036e-161,
          7.1018e-161, 6.2487e-161],
         [5.9396e-161, 6.7429e-161, 5.9226e-161,  ..., 7.2868e-161,
          7.1517e-161, 5.6393e-161],
         [4.5438e-161, 5.3383e-161, 4.7825e-161,  ..., 6.0216e-161,
          5.0424e-161, 3.9240e-161]]], device='cuda:0', dtype=torch.float64,
       grad_fn=<FlipBackward0>)

That's before calling

        optimizer.zero_grad()
        vis = model.forward()
        sky_cube = model.icube.sky_cube

so the absolute last time the skycube has non-nan values. Then after calling the above, we get:

failing on iteration 3755
sky_cube
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float64, grad_fn=<FlipBackward0>)

I'm not sure what to make of this.

iancze commented 1 year ago

Thanks for the updates here. A few thoughts, questions, potential things to try:

briannazawadzki commented 1 year ago

Alright, I've done a bit more digging to see if I can get to the bottom of this.

To be clear about what's going on before you get to iteration 3754, the loss keeps decreasing and the sky_cube values keep getting smaller, but the images themselves don't have nan anywhere, that appears after 3755?

Yup. I've generated a pretty comprehensive output file which verifies this.

The fact that you have such a small number (1e-161) in the ImageCube makes me wonder what values the BaseCube has at this iteration, and whether this is close to what can be stored in a double float. I wouldn't think it would be an issue if we needed to store very small values (they could always go to 0), but maybe I'm missing something. Just so I understand, the image you posted on the right, that has min/max values of 3.52278e-161 / 3.8696112e-159? So the image still shows something like disk emission, just that it's dynamic range has been squashed tremendously?

Correct. I also considered that the sudden switch to nans might be caused by some computational precision issue. But even if that's what's causing the nans, we still don't know what's squashing these min/max values. At the iteration before failing, the BaseCube values range from -370.32 to -364.90. At the fail point, both the BaseCube and the ImageCube are all nans.

Have you switched/updated your source version of MPoL to include the commits from last week on the DataAverager and DirtyImager? I'm just wondering if there might be an issue stemming from the relative change in likelihood relative to the other loss functions (I wouldn't think this would matter, but you never know...)

I updated to see if this would fix the issue, but unfortunately it didn't.

Regarding your comment about the entropy prior intensity: I agree with you that it seems bizarre to me that an image with max intensities of 1e-159 would yield a lower loss function value than, say, one with uniform values of 1e-7. Outside of any optimization loop, can you just verify what the nll loss, entropy loss, and sum of both are for the last image w/o nans and a blank one of 1e-7?

For the last image without nans, the nll loss is 1.5563 and the entropy loss is -105.7719. The sum (i.e. total loss) is -104.2156. I can work on getting those values for a blank cube of 1e-7.

Is there anything weird about the value (type, int, string, etc...) that you are using for the entropy prior intensity? To be safe you could specify it as a torch tensor, since that's how it would be used inside of the entropy function.

I've been specifying in the same ways I always have been, but I've tried changing it up and this hasn't had any effect.

You are just running with the entropy regularizer, right? No other regularizers?

Correct.

iancze commented 1 year ago

Closed by #179 . I think the summary is that @iancze had originally implemented the total flux prefactor normalization incorrectly. In the EHT IV paper (which we claimed to follow!) they use this prefactor as a fixed estimate of the total flux. In our original implementation, we implemented this as a normalization that was recalculated using the total flux of the current image. Apparently this lead to instabilities when optimizing, including exploding gradients. PR #179 updated the entropy loss function to be the same as EHT IV (with fixed, constant prefactor), and so far this issue seems to be resolved.