Closed VonbatenBach closed 1 year ago
Hi, during preprocessing nnU-Net generates -1 for non-zeoro areas.
during data augmentation they -1 is again replaced by 0. So need to change anything.
Please, try the the nnUNetTrainerDiceCELoss_noSmooth trainer instead of the default one. Sometimes it already solves the nan problem. Btw. your epoch time looks way to high...
I've just tried nnUNetTrainerDiceCELoss_noSmooth (and several others). But it's still exactly the same.
My epoch time is really high, that's true. I'm guessing it might be just because I have a slow GPU - GTX 1660 Super. Do you think that epoch time should be lower regardless?
So far, I've tried the following:
None of these worked. What could be the problem?.. I've heard that such behaviour may be caused by a reduced precision mode when FP16 format is used instead of a typical 32bit float. I don't know, however, if (and where) nnUNet uses such things.
Another interesting thing. I changed num_iterations_per_epoch
and num_val_iterations_per_epoch
to 1 (just to debug faster) and although the 0th epoch took 100s, the following ones took only 6s. I attach a log file. There's been also an error at some point:
.../nnUNetTrainer.py:970: RuntimeWarning: invalid value encountered in scalar divide global_dc_per_class = [i for i in [2 * i / (2 * i + j + k) for i, j, k in
training_log_2023_8_10_12_00_01.txt
Hi, nnUNet uses mixed precision which is for most operations FP16, but if you use only default settings it is well tested and should not cause nans. Could you send me the debug.json and your pla file?
Sure, I attach these files from a run with nnUNetTrainerDiceCELoss_noSmooth. I had to convert them into txt, because github doesn't accept json files. plans.json.txt debug.json.txt
Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks
I uploaded it on Google Drive: https://drive.google.com/drive/folders/104JMYB84x4diR43l_alLWRpvMx9wMa9Z?usp=sharing Thanks for help
Have you solved the problem bro?I ran into the same problem because of my own network and loss function
No, the problem hasn't been solved yet.
Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks
Hi, were you able to test out that data?
Hi, i just downloaded your data and did the default preprocessing and training. I do not get any nans...
You could try to pull the newest version and redo the preprocessing. Maybe, there are some corrupted files
Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks
Hi, were you able to test out that data?
I will try it. Have you ever had a problem with Pseudo dice being very low when this problem occurs?
@VonbatenBach could you try a different setup? Maybe your hardware is causing the problems.
@VonbatenBach could you try a different setup? Maybe your hardware is causing the problems.
My data have nice dice in nnUNet and I want to change something in the model witthout deepsupervison.BUT when I replace nnUNet default dice and model with my own dice and model,the Pseudo dice =0 appeared. I'm too confused about that.
I got some information:"Be sure to softmax first, then extract the second layer (i.e. the target layer), and do not change the input preds themselves, be sure to get a copy. Because there is an external evaluation mask that follows the same process, if you extract the second layer before sigmoid here, it will lead to an error in the external evaluation mask, and the psuedo dice will always be 0" But the problem still occurs .here is my dice: def my_get_dice_loss(preds: torch.tensor, target: torch.tensor): pred=torch.softmax(preds,dim=1)[:,1,:,:,:].unsqueeze(1)
#print(target.shape)
inter = (pred * target)
union = (pred + target)
# pred:BCHW, target: BCHW
if (len(pred.shape) == 5) and (len(target.shape) == 5):
# inter = inter.sum(dim=1).sum(dim=1).sum(dim=1).sum(dim=1)
# union = union.sum(dim=1).sum(dim=1).sum(dim=1).sum(dim=1)
inter = inter.sum(dim=2).sum(dim=2).sum(dim=2)
union = union.sum(dim=2).sum(dim=2).sum(dim=2)
dice_loss = 1 - 2 * (inter + 1) / (union + 2)
# dice_loss = DiceCELoss(
# to_onehot_y=False, softmax=False, squared_pred=False, smooth_nr=1e-5, smooth_dr=1e-5
# )
# result = dice_loss(pred, target)
return dice_loss
Okay thanks. So far i can't find anything. I will try to reproduce this error. Is it possible to share the data? Thanks
Have you solved the problem ? ![Uploading progress.png…]()
Hi, during preprocessing nnU-Net generates -1 for non-zeoro areas.↳
during data augmentation they -1 is again replaced by 0. So need to change anything.↳
Please, try the the nnUNetTrainerDiceCELoss_noSmooth trainer instead of the default one. Sometimes it already solves the nan problem. Btw. your epoch time looks way to high...
请尝试使用 nnUNetTrainerDiceCELoss_noSmooth 训练器而不是默认训练器。 有时它已经解决了 nan 问题。 顺便提一句。 你的纪元时间看起来太高了......
how can i use nnUNetTrainerDiceCELoss_noSmooth?thanks
Hi, I have a problem with training nnUNetv2. When I run my preprocessed dataset (any fold), in the first epoch (as well as in the next ones) I get the following output:
training_log_2023_8_9_14_55_06.txt
I've already checked whether there are any NaNs in my original data - there are none. My data format is .png. I work with 1 modality/channel pictures, 2 segmentation labels - I have them numerated as "0, 1, 2" with 0 being the background.
I preprocessed the data using
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
. It ran without any problems. After inspecting my original data (btw I guess masks are properly made, data type is uint8 and on each mask there are two unique values: (0,1) or (0,2)) I checked the files located in nnUNet_preprocessed. I haven't found any NaNs again. There was however one abnormality - when I was checking *_seg.npy files (segmentation masks which I guess nnUNet is using in training) some of them (maybe 5% or less) had additional label "-1" added. I thought that maybe this is the source of my problems, so I modified affected fiels so that they have labels in the range "0, 1, 2" again. This did not helped, so now I really don't know what to do. I can share my dataset via google drive if it would help.