Project-MONAI / research-contributions

Implementations of recent research prototypes/demonstrations using MONAI.
https://monai.io/
Apache License 2.0
995 stars 328 forks source link

WARNING:root:NaN or Inf found in input tensor when training UNETR network. #27

Closed chuxiang93 closed 2 years ago

chuxiang93 commented 2 years ago

Describe the bug Nan popped up while I was training the UNETR network. WARNING:root:NaN or Inf found in input tensor.

To Reproduce Steps to reproduce the behavior:

  1. Go to 'UNETR/BTCV'
  2. Run commands: CUDA_VISIBLE_DEVICES=1,2,3 python main.py --distributed --feature_size=32 --batch_size=4 --logdir=unetr_test --optim_lr=1e-3 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/data/BTCV/Abdomen/RawData/Training/ --workers=12

Screenshots image

Environment (please complete the following information):

ahatamiz commented 2 years ago

Hi @mjq93

I am sorry you are facing this difficulty. We never ran into a NaN while training the UNETR. It is most likely caused by DiceCELoss as the values of denominator become very small.

One quick fix it to set squared_pred = False in DiceCELoss. This prevents the value of predictions, which are between 0 and 1, to further become smaller -- although we observed slightly better performance by squaring the predictions.

Another fix would be to pass a larger value such as smooth_dr = 1e-5 to prevent division by zero. I believe first fix will work out of the box and is more preferred.

Thanks

liuyanice commented 2 years ago

@mjq93 I have met the same issue that the loss became Nan when training. Both the two fixs are invalid, how do you solve them in the end?

liuyanice commented 2 years ago

I have met the same issue that the loss became Nan when training. Both the two fixs are invalid, how do you solve them in the end?

liuyanice commented 2 years ago

8}K 5)Y)YS WZ_A8 $}Z)AC

liuyanice commented 2 years ago

Thank you, I have solved it