MIC-DKFZ / nnUNet

Apache License 2.0
6k stars 1.78k forks source link

train loss nan, val loss nan and dice is 0 #2290

Open Malitha123 opened 5 months ago

Malitha123 commented 5 months ago

I am trying to segment heart (LV, MYO and RV) in my 200 nifti images. When I using 2D segmentation, my training loss, validation loss become nan and dice score becomes 0. This only happens when I am using 2D. In 3D segmentation it works well.

2024-06-12 18:20:21.593942: Unable to plot network architecture: 
2024-06-12 18:20:21.599466: module 'torch.onnx' has no attribute '_optimize_trace' 
2024-06-12 18:20:21.620680:  
2024-06-12 18:20:21.625154: Epoch 0 
2024-06-12 18:20:21.630196: Current learning rate: 0.01 
2024-06-12 18:26:16.488273: train_loss 0.3602 
2024-06-12 18:26:16.493741: val_loss 0.3339 
2024-06-12 18:26:16.497657: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:26:16.502149: Epoch time: 354.87 s 
2024-06-12 18:26:16.507174: Yayy! New best EMA pseudo Dice: 0.0 
2024-06-12 18:26:18.638051:  
2024-06-12 18:26:18.641770: Epoch 1 
2024-06-12 18:26:18.646975: Current learning rate: 0.00999 
2024-06-12 18:31:37.105687: train_loss 0.1213 
2024-06-12 18:31:37.113853: val_loss 0.4789 
2024-06-12 18:31:37.118266: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:31:37.122863: Epoch time: 318.47 s 
2024-06-12 18:31:38.829792:  
2024-06-12 18:31:38.834197: Epoch 2 
2024-06-12 18:31:38.840446: Current learning rate: 0.00998 
2024-06-12 18:36:38.060880: train_loss 0.5986 
2024-06-12 18:36:38.067308: val_loss 3.6355 
2024-06-12 18:36:38.073064: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:36:38.077461: Epoch time: 299.23 s 
2024-06-12 18:36:39.684194:  
2024-06-12 18:36:39.690887: Epoch 3 
2024-06-12 18:36:39.698113: Current learning rate: 0.00997 
2024-06-12 18:41:48.443093: train_loss 0.8558 
2024-06-12 18:41:48.449760: val_loss 3.0473 
2024-06-12 18:41:48.455287: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:41:48.460392: Epoch time: 308.76 s 
2024-06-12 18:41:49.830043:  
2024-06-12 18:41:49.834860: Epoch 4 
2024-06-12 18:41:49.840931: Current learning rate: 0.00996 
2024-06-12 18:47:00.747054: train_loss 1.2692 
2024-06-12 18:47:00.779041: val_loss 4.5155 
2024-06-12 18:47:00.800542: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:47:00.823451: Epoch time: 310.92 s 
2024-06-12 18:47:02.595355:  
2024-06-12 18:47:02.599971: Epoch 5 
2024-06-12 18:47:02.607214: Current learning rate: 0.00995 
2024-06-12 18:52:00.977607: train_loss 1.3166 
2024-06-12 18:52:00.983934: val_loss 2.9056 
2024-06-12 18:52:00.988332: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:52:00.993186: Epoch time: 298.38 s 
2024-06-12 18:52:02.361344:  
2024-06-12 18:52:02.367609: Epoch 6 
2024-06-12 18:52:02.372590: Current learning rate: 0.00995 
2024-06-12 18:57:03.786279: train_loss nan 
2024-06-12 18:57:03.792261: val_loss nan 
2024-06-12 18:57:03.796917: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 18:57:03.801277: Epoch time: 301.43 s 
2024-06-12 18:57:05.161590:  
2024-06-12 18:57:05.166871: Epoch 7 
2024-06-12 18:57:05.171144: Current learning rate: 0.00994 
2024-06-12 19:02:09.025165: train_loss nan 
2024-06-12 19:02:09.032341: val_loss nan 
2024-06-12 19:02:09.037521: Pseudo dice [0.0, 0.0, 0.0] 
2024-06-12 19:02:09.043633: Epoch time: 303.87 s 
JackRio commented 5 months ago

Few things: Check the field of view you are passing when using 2d (Patch size) If you have lot's of empty voxels try the Oversampling trainer during 2D Or reduce the patch size instead of median size

Malitha123 commented 5 months ago

tried this. But issue is same

Few things: Check the field of view you are passing when using 2d (Patch size) If you have lot's of empty voxels try the Oversampling trainer during 2D Or reduce the patch size instead of median size

mrokuss commented 5 months ago

Could you update/re-initiate your environment to the newest version and dependencies and try rerunning with nnUNetTrainerDiceCELoss_noSmooth. Hope this helps!