Closed William4553 closed 2 years ago
Thank you for your interest in our work! I check the code released and make sure that the operations {division, log, and sqrt} have added an eps. But, You can try to clamp the input of the POW operation to a range greater than 0, i.e., sine = torch.sqrt((1.0 - torch.pow(cosine.(0.0,1.0), 2)).clamp(0.0001, 1.0)) in 152 lines of utils.py, because torch.pow(a) = nan when a < 0. If the above operation cannot address your problem, you can try to use the torch.nn.utils.clip_gradnorm operation to clip gradient norm of an iterable of parameters. If the above suggestions solve your problem, I hope you can let me know so that I can update the code, thanks!!
Alternatively, you can try clamping the input of torch.sqrt() to a larger scale, e.g., sine = torch.sqrt((1.0 - torch.pow(cosine.(0.0,1.0), 2)).clamp(0.1, 1.0)). Because, the grad of torch.sqrt 1/torch.sqrt(x), if x = 0.0001, the grad = 10000
Thanks for the response and for the advice. I found that line 35 of loss.py is causing my problem. When I removed the weight parameter from loss = F.cross_entropy(input=pred,target=label,weight=weight), I no longer got the issue with nan. But when the weight parameter is there, loss has the value nan.
Thanks for your reply! I will check the code again.
Thanks to Feng Wei from Zhejiang University. We found that the instability of the training may be due to the exceptionally small (close to zero) weights of the background classes.
Hi, thanks for the work. I have a question about the training process. When I run the training, the first 25 iterations seem to work fine, but on the 26th iteration and afterwards, I am only getting nan for all of the numerical values (for loss_seg_src_aux, loss_dice_src_aux, etc.). This seems to be caused by the model itself predicting nan. Also, the testing process works fine, so it is only the training process that I am having the issue with. Do you have any idea what might be causing this issue or how I can resolve it? Thank you for the help.