MIC-DKFZ / nnUNet

Apache License 2.0
5.9k stars 1.76k forks source link

NaN in train_loss and val_loss, 0.0 in dice #1612

Closed VonbatenBach closed 1 year ago

VonbatenBach commented 1 year ago

Hi, I have a problem with training nnUNetv2. When I run my preprocessed dataset (any fold), in the first epoch (as well as in the next ones) I get the following output:

2023-08-09 14:55:25.642011: Epoch 0 
2023-08-09 14:55:25.642085: Current learning rate: 0.01 
2023-08-09 15:17:51.409852: train_loss nan 
2023-08-09 15:17:51.440985: val_loss nan 
2023-08-09 15:17:51.441061: Pseudo dice [0.0, 0.0] 
2023-08-09 15:17:51.452639: Epoch time: 1345.72 s 
2023-08-09 15:17:51.456612: Yayy! New best EMA pseudo Dice: 0.0 
2023-08-09 15:18:00.922120:  
2023-08-09 15:18:00.922225: Epoch 1 
2023-08-09 15:18:00.922357: Current learning rate: 0.00999 

training_log_2023_8_9_14_55_06.txt

I've already checked whether there are any NaNs in my original data - there are none. My data format is .png. I work with 1 modality/channel pictures, 2 segmentation labels - I have them numerated as "0, 1, 2" with 0 being the background.

I preprocessed the data using nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity. It ran without any problems. After inspecting my original data (btw I guess masks are properly made, data type is uint8 and on each mask there are two unique values: (0,1) or (0,2)) I checked the files located in nnUNet_preprocessed. I haven't found any NaNs again. There was however one abnormality - when I was checking *_seg.npy files (segmentation masks which I guess nnUNet is using in training) some of them (maybe 5% or less) had additional label "-1" added. I thought that maybe this is the source of my problems, so I modified affected fiels so that they have labels in the range "0, 1, 2" again. This did not helped, so now I really don't know what to do. I can share my dataset via google drive if it would help.

constantinulrich commented 1 year ago

Hi, during preprocessing nnU-Net generates -1 for non-zeoro areas.

during data augmentation they -1 is again replaced by 0. So need to change anything.

Please, try the the nnUNetTrainerDiceCELoss_noSmooth trainer instead of the default one. Sometimes it already solves the nan problem. Btw. your epoch time looks way to high...

VonbatenBach commented 1 year ago

I've just tried nnUNetTrainerDiceCELoss_noSmooth (and several others). But it's still exactly the same.

My epoch time is really high, that's true. I'm guessing it might be just because I have a slow GPU - GTX 1660 Super. Do you think that epoch time should be lower regardless?

VonbatenBach commented 1 year ago

So far, I've tried the following:

None of these worked. What could be the problem?.. I've heard that such behaviour may be caused by a reduced precision mode when FP16 format is used instead of a typical 32bit float. I don't know, however, if (and where) nnUNet uses such things.

VonbatenBach commented 1 year ago

Another interesting thing. I changed num_iterations_per_epoch and num_val_iterations_per_epoch to 1 (just to debug faster) and although the 0th epoch took 100s, the following ones took only 6s. I attach a log file. There's been also an error at some point: .../nnUNetTrainer.py:970: RuntimeWarning: invalid value encountered in scalar divide global_dc_per_class = [i for i in [2 * i / (2 * i + j + k) for i, j, k in training_log_2023_8_10_12_00_01.txt

constantinulrich commented 1 year ago

Hi, nnUNet uses mixed precision which is for most operations FP16, but if you use only default settings it is well tested and should not cause nans. Could you send me the debug.json and your pla file?

VonbatenBach commented 1 year ago

Sure, I attach these files from a run with nnUNetTrainerDiceCELoss_noSmooth. I had to convert them into txt, because github doesn't accept json files. plans.json.txt debug.json.txt

constantinulrich commented 1 year ago

Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks

VonbatenBach commented 1 year ago

I uploaded it on Google Drive: https://drive.google.com/drive/folders/104JMYB84x4diR43l_alLWRpvMx9wMa9Z?usp=sharing Thanks for help

Overflowu7 commented 1 year ago

Have you solved the problem bro?I ran into the same problem because of my own network and loss function

VonbatenBach commented 1 year ago

No, the problem hasn't been solved yet.

VonbatenBach commented 1 year ago

Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks

Hi, were you able to test out that data?

constantinulrich commented 1 year ago

Hi, i just downloaded your data and did the default preprocessing and training. I do not get any nans...
You could try to pull the newest version and redo the preprocessing. Maybe, there are some corrupted files

Overflowu7 commented 1 year ago

Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks

Hi, were you able to test out that data?

I will try it. Have you ever had a problem with Pseudo dice being very low when this problem occurs?

constantinulrich commented 1 year ago

@VonbatenBach could you try a different setup? Maybe your hardware is causing the problems.

Overflowu7 commented 1 year ago

@VonbatenBach could you try a different setup? Maybe your hardware is causing the problems.

My data have nice dice in nnUNet and I want to change something in the model witthout deepsupervison.BUT when I replace nnUNet default dice and model with my own dice and model,the Pseudo dice =0 appeared. I'm too confused about that.

I got some information:"Be sure to softmax first, then extract the second layer (i.e. the target layer), and do not change the input preds themselves, be sure to get a copy. Because there is an external evaluation mask that follows the same process, if you extract the second layer before sigmoid here, it will lead to an error in the external evaluation mask, and the psuedo dice will always be 0" But the problem still occurs .here is my dice: def my_get_dice_loss(preds: torch.tensor, target: torch.tensor): pred=torch.softmax(preds,dim=1)[:,1,:,:,:].unsqueeze(1)

print(pred)

#print(target.shape)
inter = (pred * target)
union = (pred + target)
# pred:BCHW, target: BCHW
if (len(pred.shape) == 5) and (len(target.shape) == 5):
    # inter = inter.sum(dim=1).sum(dim=1).sum(dim=1).sum(dim=1)
    # union = union.sum(dim=1).sum(dim=1).sum(dim=1).sum(dim=1)
    inter = inter.sum(dim=2).sum(dim=2).sum(dim=2)
    union = union.sum(dim=2).sum(dim=2).sum(dim=2)
dice_loss = 1 - 2 * (inter + 1) / (union + 2)
# dice_loss = DiceCELoss(
#     to_onehot_y=False, softmax=False, squared_pred=False, smooth_nr=1e-5, smooth_dr=1e-5
# )
# result = dice_loss(pred, target)
return dice_loss
2345678y3 commented 7 months ago

Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks

Have you solved the problem ? ![Uploading progress.png…]()

2345678y3 commented 7 months ago

Hi, during preprocessing nnU-Net generates -1 for non-zeoro areas.

during data augmentation they -1 is again replaced by 0. So need to change anything.↳

Please, try the the nnUNetTrainerDiceCELoss_noSmooth trainer instead of the default one. Sometimes it already solves the nan problem. Btw. your epoch time looks way to high...

请尝试使用 nnUNetTrainerDiceCELoss_noSmooth 训练器而不是默认训练器。 有时它已经解决了 nan 问题。 顺便提一句。 你的纪元时间看起来太高了......

how can i use nnUNetTrainerDiceCELoss_noSmooth?thanks