Getting nan results when training

TFboys-lzz / MPSCL

This repository contains code for the paper "Margin Preserving Self-paced Contrastive Learning Towards Domain Adaptation for Medical Image Segmentation", published at IEEE JBHI 2022

29 stars 1 forks source link

Getting nan results when training #2

Closed William4553 closed 2 years ago

William4553 commented 2 years ago

Hi, thanks for the work. I have a question about the training process. When I run the training, the first 25 iterations seem to work fine, but on the 26th iteration and afterwards, I am only getting nan for all of the numerical values (for loss_seg_src_aux, loss_dice_src_aux, etc.). This seems to be caused by the model itself predicting nan. Also, the testing process works fine, so it is only the training process that I am having the issue with. Do you have any idea what might be causing this issue or how I can resolve it? Thank you for the help.

TFboys-lzz commented 2 years ago

Thank you for your interest in our work！ I check the code released and make sure that the operations {division, log, and sqrt} have added an eps. But, You can try to clamp the input of the POW operation to a range greater than 0, i.e., sine = torch.sqrt((1.0 - torch.pow(cosine.(0.0,1.0), 2)).clamp(0.0001, 1.0)) in 152 lines of utils.py, because torch.pow(a) = nan when a < 0. If the above operation cannot address your problem, you can try to use the torch.nn.utils.clip_gradnorm operation to clip gradient norm of an iterable of parameters. If the above suggestions solve your problem, I hope you can let me know so that I can update the code, thanks!!

TFboys-lzz commented 2 years ago

Alternatively, you can try clamping the input of torch.sqrt() to a larger scale, e.g., sine = torch.sqrt((1.0 - torch.pow(cosine.(0.0,1.0), 2)).clamp(0.1, 1.0)). Because, the grad of torch.sqrt 1/torch.sqrt(x), if x = 0.0001, the grad = 10000

William4553 commented 2 years ago

Thanks for the response and for the advice. I found that line 35 of loss.py is causing my problem. When I removed the weight parameter from loss = F.cross_entropy(input=pred,target=label,weight=weight), I no longer got the issue with nan. But when the weight parameter is there, loss has the value nan.

TFboys-lzz commented 2 years ago

Thanks for your reply! I will check the code again.

TFboys-lzz commented 2 years ago

Thanks to Feng Wei from Zhejiang University. We found that the instability of the training may be due to the exceptionally small (close to zero) weights of the background classes.