bethgelab / game-of-noise

Trained model weights, training and evaluation code from the paper "A simple way to make neural networks robust against diverse image corruptions"
https://arxiv.org/abs/2001.06057
MIT License
62 stars 12 forks source link

Weights and biases of noise generator turn into NaN in some epoch #4

Closed ruitian12 closed 3 years ago

ruitian12 commented 3 years ago

Hi! Thanks for your impressive work. I've been trying to implement the code and slightly adjusted the configuration. The result so far was in accordance with the paper but rarely training failed for NaN's appearance in weights and biases of noise generator. I want to make sure if my own training settings triggered the NaN. Have you ever encountered the same failure during your tries?

EvgeniaAR commented 3 years ago

Thanks for reaching out :).

In fact, YES, we have encountered NaNs in the weights, but only after we tried combining several generators. The problem is the following: Pytorch ImageNet training uses 224x224 crops and some of them are completely white, i.e. has the maximum values in the 3 color channels. In other words, the cropped image has ones everywhere and not a single entry that is not a 1. If now the generator happens to have a distribution purely in the positive range, then we try to add more positive values to the image and fix the noise to a certain pertubation size here: https://github.com/bethgelab/game-of-noise/blob/c5e3bc92bfc1cacf7b86e807111e779a8e2b8612/utils.py#L81. So, "eta" becomes 0 here and the sqrt of the zero produces the NaN by taking the square root: https://github.com/bethgelab/game-of-noise/blob/c5e3bc92bfc1cacf7b86e807111e779a8e2b8612/evaluation_utils.py#L38. This then results in the NaNs in the weights.

The fix that I implemented in the current code base (should probably push it to the official one) is the following: Crops of the same color (all white or all black) don't make sense in any case. What would be the label of a completely white crop? So we remove them on the fly. This is generally a "problem" in ImageNet training which we usually don't notice. The image itself is not white but only the 224x224 crop is.

The problem occurrs very rarely because the noise generator rarely produces a distribution fully either in the positive or negative range. Once we tried training several generators at once, it started occurring frequently, because the different distributions were being pushed apart and we ended up with a generator producing only positive values and another producing only negative values. This is when we noticed the issue in the first place.

The fix (which removed the issue):


 targets = targets[images.flatten(1).var(dim=1) > 0]
 images = images[images.flatten(1).var(dim=1) > 0]
ruitian12 commented 3 years ago

Thanks for your fast & detailed reply!

I try to construct NaN in fix_perturbation_size based on the instance you mentioned above (The input x0 has ones everywhere (is completely white) and delta_img is completely positive) However, though I get eta2 for zero, eta won't turns into NaN after torch.sqrt(eta2). My procedure is shown as follows.

image

fix_perturbation_size may also have little influence on weight of noise generator. I printed out the intermediate layer of noise generator during training. I noticed that output of first conv layer would contain NaN.

https://github.com/bethgelab/game-of-noise/blob/c5e3bc92bfc1cacf7b86e807111e779a8e2b8612/Networks/generators.py#L42

In fact, NaN appeared in weight of convolutional layer firstly, leading to NaN in delta_img. Then output of fix_perturbation_size and loss consequently turned into NaN. https://github.com/bethgelab/game-of-noise/blob/c5e3bc92bfc1cacf7b86e807111e779a8e2b8612/utils.py#L80

Thus I am wondering if any other factors might contribute to NaN in weight.

Thanks! :)

EvgeniaAR commented 3 years ago

ok I see. This was half a year ago and I might be misremembering something as well. I definitely remember the reason why it happened but I might be mixing up the exact line where the NaN occurred. The two lines I suggested to you above definitely fixed the NaN problem for me.

Considering other possible NaN reasons: could you please use python debugger to see what the outputs and inputs look like when the conv layer turns into a NaN? try "import pdb" and then check for a NaN in the conv weight as:

for name, param in noise_generator.named_parameters():
    if torch.sum(torch.isnan(param.data)) > 0:
        print('true, nans detected')
        pdb.set_trace()

Then you could step through the code and print the current variables. This is how I found out that my problem was the one I described above. I haven't encountered other types of NaNs so far though, but this is what I would do to find out the reason. In fact, you actually want to track the NaNs in the gradients since they occur there first:

if torch.sum(torch.isnan(noise_generator.conv_2d_1.weight.grad)) > 0:
        pdb.set_trace()

What I then did was just saving the current input batch and the current sampled distribution and then doing the noise addition manually. The NaN occurred just like during training and I could fix it. Sorry, I can't be of more help, please let me know how it goes.

ruitian12 commented 3 years ago

Since NaN occurs on a rare basis, I would try to implement your solution and fix it next time. Thanks for your great work & useful help all the way.