Meet "NaN" Problem. - Githubissues

Alan-xw commented 3 years ago

Hey, I implement the SRFlow based on the paper and Glow source code with Pytorch. But I encounter the "NaN" problem during testing. I put the testing LR image and generated z_samples into SRflow reverse net. But It will generate "NaN" value . Have you ever met this issue?

Looking forward to your reply.!

eridgd commented 3 years ago

@Alan-xw could you please post a link to your code?

andreas128 commented 3 years ago

Where do the NaNs occur in the network? As the network architecture is bijective, even the untrained network should be invertible.

Try to predict the latent vectors (z) for an HR-LR pair and then feed the z-LR pair to the reverse network. If this does not give you the HR image, reduce the network until this is the case.

Alan-xw commented 3 years ago

Hey ,I use a generated z and LR testing image to generate its SR version for evaluation during training. The NaNs would occurs in the SR results, causing black rectangle regions in the images.

Another trouble:

I use exp function in the AffineCouplingLayer and AffineInjectorLayer as described in the paper of SRFlow. But When I start to train the network, it would like to occur similar errors like "RuntimeError: 'DivBackWard0' return nan values in its 1th output." I checked this error and found it was caused by the exp func. . If you met this issue , how did you solve it?

andreas128 commented 3 years ago

You could try following:

Decode the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1].
Add a small number to the division to protect against divide by zero.
Print the mean absolute value after each layer to find out where the NaN occur.

Does that help? You can also drag and drop images in here.

Alan-xw commented 3 years ago

As you can see in the picture, The NaN would often occur in the f_s_exp and f_b, also in h(since h is related to f_s_exp, f_b) at the deeper level of the SRFlow like level-2, 15th- ConditionalFlowStep. I have printed the mean absolute value after each layer in debug mode as you suggested. And I found that once there is a large pixel value (may > 1) of in encoding Network output feature u or intermidiate feature h, After multiple product and addition, the NaN occurs.

black rectangle regions look like these patterns shown below.

andreas128 commented 3 years ago

Did you try those experiments from the previous answer?

Decode the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1] and compare them visually.
Add a small number to the division to protect against divide by zero.

Also adding Noise might help Did you try to add noise as described in the Appendix?

Screenshot 2020-10-17 at 19 43 26

Looking forward to see your results!

Alan-xw commented 3 years ago

Thanks for your advice ! I have tried those experiments:

I have decoded the z-LR pair with z of standard deviation [0, 0.1, 0.2, ... 1.0, 1.1]. When deviation is less than 0.5, The black reigions disappear. However, the image is blurry and lacks details.

2.I also have added a small number to the division to protect against divide by zero . But tn the training stage, "RuntimeError: 'DivBackWard0' return nan values in its 1th output." still exists.

For Adding Nosie : I have added noise to the HR image that you mentioned . And The pixel value in the input HR ranges [0, 255] , [-0.5, 0.5] or [0, 1] ?

andreas128 commented 3 years ago

Sorry for the late reply. We have good news! We were allowed to publish the model codes.

Does this help you? Any thoughts on improving it?

Alan-xw commented 3 years ago

Thanks for sharing the codes . I have solved this issue and trained the SRFlow well which is slightly different from the source code. I would take deep research and do some experiments on the source code Later~

Thanks again!

andreas128 commented 3 years ago

Great, feel free to reach out anytime!

mxtsai commented 3 years ago

Hi @Alan-xw! I'm also experiencing an issue where I obtain NaN after training for around 10k iterations, and I also get the black rectangles as you've obtained.

I've added uniform noise [0, 1/256) to the input (which is scaled and shifted to values between [-0.5, 0.5) )

May I know what you've done to solve the issue of NaN training? And have you figured out what was causing this issue?

(I'm also using code that is slightly different from the source code)

Thanks in advance!

neonbjb commented 3 years ago

I'm also seeing the NaNs when training the model from the code in this repo. They only appear when computing the flow chain in 'reverse' direction for me. 'Normal' is stable and trains well. Feeding z values generated from a 'normal' pass into a 'reverse' pass reproduces the HR input, as expected.

Using a suggestion above from @andreas128 I tracked the issue down to the affine coupling operation here. I think there is a natural mathematical instability here because if the scale gets too small in one layer, future layers will get multiplicatively smaller and smaller until the result is a inf (in the 'reverse' pass, these scales are used as the divisor).

I don't have any good ideas to fix this because I think the root cause is that the network simply isn't capable of 'using' some of the gaussian vectors I'm providing to it for z values. By this I mean the NLL loss trains the flow network to split an HQ image into an LQ embedding and some Z value that is indistinguishable from a gaussian, but does that mean that necessarily all possible gaussian vectors can combine with any arbitrary LQ embedding? This is one aspect of training flow networks I don't quite understand..

I suspect as I continue training, the network will learn to "use" a broader spectrum of Z latents, and this problem will diminish or disappear complete. It certainly appears to be doing so a bit already at 12k steps.. Still, it'd be nice to know if the authors or anyone else training these networks sees this issue early on.

Here's an example batch of images from my side: 11700_00 Black boxes are areas with NaN. Not quite sure about what causes the noise.

mxtsai commented 3 years ago

@neonbjb From my former experiments (haven't checked it in my recent experiments), the scales seemed to be at a reasonable value. I was able to get my model training at a more stable rate after adjusting a lower learning rate (2e-4). Have you plotted the log-prior (log_p) term and the log-determinant term during training? I observe that my model was producing better results when the log-determinant term decreases and the log-p term increases during training.

neonbjb commented 3 years ago

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

LyWangPX commented 3 years ago

In my case, I did not take a look at the glow model source code (which I regretted) and built my own. Thus I do not have the regularized terms like divide by log2 or something like that. In such genuine case, coupling layers especially where the f is implemented is most likely to meet nan in early training process. I somehow fixed it by adding a clip. As training goes on, everything calms down.

Alan-xw commented 3 years ago

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

I have tested the images on the public testing code and find that when t = 1 the Nan still exits in the SR results. The main reason I think is that the latent vector distribution generated by the model still does not match the standard Gaussian distribution.

neonbjb commented 3 years ago

In case anyone runs into my problem: It appears to be just part of the training process. These artifacts do seem to reduce over time as the network trains. I suppose this intuitively makes sense, I'm just surprised no-one else had run into it.

I have tested the images on the public testing code and find that when t = 1 the Nan still exits in the SR results. The main reason I think is that the latent vector distribution generated by the model still does not match the standard Gaussian distribution.

+1 I've also seen it with the pretrained model.

I think that your statement could be elaborated. The model does not generalize to a standard gaussian for all possible input lr images. The nan areas of the image often correspond to unique areas of the image that don't occur often in natural images.

This is actually a kind of neat result in and of itself, but it makes using this for real sr challenging..

avinash31d commented 3 years ago

Hi @mxtsai , Can you please also give some more details on "I observe that my model was producing better results when the log-determinant term decreases and the log-p term increases during training".

What is the possible range of log-p values? what is the possible range of logdet values?

avinash31d commented 3 years ago

Hi @andreas128 ,

I noticed few differences in the code and paper. Can you please put some light on it.

In paper actnorm and affine layers have scaling first and then the shifting is applied, whereas in the code its the other way.
I also found some use of exp() switched between the calculation of value and logdets as compared to paper.

can this be possible reason for the above shown artifacts.?

Pierresips commented 2 years ago

Hello.

On my side the Nan artifact did appears on a more simple case. I did clone the git repos and run it on div2K data set. The issue rise a bit in the 4x zoom ratio but always on the version with the higher variance. But I tried also the 8x case with the pretrained model and in that case all images get the Nan issue. I'm a bit surprise of this issue. As the issue seems related to zeros division or potential computation precision (from previous comment did not yet searched) could it be also related to hardware side. I do well have the requirements specified, on the HW I have a rtx 3090.

Zuomy826 commented 2 years ago

@andreas128 Hello，I added a new loss in the optimize_parameters，I add loss L2 between sr and gt（Set the value of reverse to true）, but the nll which this function returned become nan at about 10k iters, svd_cuda: the updating process of SBDSDC did not converage (error:11)" error appeared. When I test the model saved at iter 10k, most of the results seems fine,but there are still some black blocks on the one or two of the pictures. And I found that your code about optimize_parameters: .Both of your opt_get(self.opt, ['train', 'weight_fl'])and weight_l1 = opt_get(self.opt, ['train', 'weight_l1']) are not defined in the config，so the green one will never be calculated. So I guass Is it the newly added loss that caused this result?And how can I solve it.

Looking forward for your reply. Best wishes! Zuo

ph0316 commented 1 year ago

https://github.com/andreas128/SRFlow/issues/2#issuecomment-1080469157 Hello, I also encountered this problem. How can I solve it?

ph0316 commented 1 year ago

@andreas128您好，我在optimize_parameters中添加了一个新的损失，我在 sr 和 gt 之间添加了 loss L2（将反向的值设置为 true），但是此函数返回的 nll 在大约 10k 个迭代时变为 nan，svd_cuda：SBDSDC 的更新过程没有汇合（error：11）“错误出现。当我测试以iter 10k保存的模型时，大多数结果似乎都很好，但是在一两张图片上仍然有一些黑色块。我发现你关于optimize_parameters的代码： .您的opt_get（self.opt，['train'，'weight_fl']）和weight_l1 = opt_get（self.opt，['train'，'weight_l1']））都没有在配置中定义，因此永远不会计算绿色的。因此，我担心是新增加的损失造成了这个结果吗？我该如何解决它。

期待您的回复。愿你安好！左

Hello, I also encountered this problem. How can I solve it?

andreas128 / SRFlow

Meet "NaN" Problem. #2