MIC-DKFZ / nnUNet

Apache License 2.0
5.77k stars 1.73k forks source link

low dice #298

Closed Hhhhhhhzf closed 4 years ago

Hhhhhhhzf commented 4 years ago

Hi fabian, I have come across this problem that the dice is too low(only about 0.1) though epoch gets 200.

image

I tried the example task, Task04_Hippocampus before.It ran well, the dice 0.90. my datasets come from CADA-AS: Cerebral Aneurysm Segmentation, as the picture below shows. PS:the filenames was formatted before train.

image

And I run these commands python nnUNet_convert_decathlon_task.py -i MY_FOLDER python nnUNet_plan_and_preprocess.py -t 1 --verify_dataset_integrity python run_training.py 3d_fullres nnUNetTrainerV2 Task_Name 0

Now I can't cope with this problem and don't know the key problem, so need your help. thank you very much!

best zfh

Hhhhhhhzf commented 4 years ago

when I run 5fold, the dice gets higher, about0.6. However it is still not steady.

FabianIsensee commented 4 years ago

Hi, I don't know this dataset. I believe one of my colleagues applied nnU-Net to it and that went quite well. Maybe he can help out @mibaumgartner ? Best, Fabian

mibaumgartner commented 4 years ago

Hi, I ran nnU-Net (3d fullres) on the first task of the challenge (the data and annotations should be the same though :) ).

Fold 0 was slightly tricky because SGD with high momentum did not work out in my run (this resulted in a low dice during the first few epochs and 0 until epoch 100). This case is also covered in the code and it 'restarts' the training with a lower momentum which works much better for this fold (takes around 20 epochs [so Epoch 120 in the training] to reach dice >0.80). Towards the end of the training it should be >0.90 ..

@Hhhhhhhzf Could you please provide the output of the log file at epoch (99, 100, 101)?

Hhhhhhhzf commented 4 years ago

thank you @FabianIsensee @mibaumgartner . this is my log file

截屏2020-08-20下午9 37 05
FabianIsensee commented 4 years ago

Seems like your dice was not bad enough to be considered for the restart :-) Did you let the training finish?

Your train loss is very low, so it seems like it is training properly. Really interesting problem.

Have you tried ´nnUNetTrainerV2_momentum095´?

Best, Fabian

mibaumgartner commented 4 years ago

Just checked the loss in my runs and it looks quite different (good catch @FabianIsensee ^^)

Around Epoch 100

2020-06-19 00:24:56.069414: 
epoch:  99 
2020-06-19 00:28:11.925901: train loss : -0.3417 
2020-06-19 00:28:29.115911: validation loss: -0.3713 
2020-06-19 00:28:29.122903: Average global foreground Dice: [0.0] 
2020-06-19 00:28:29.128674: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-19 00:28:30.262686: lr: 0.009095 
2020-06-19 00:28:30.267665: saving scheduled checkpoint file... 
2020-06-19 00:28:30.384209: saving checkpoint... 
2020-06-19 00:28:30.966592: done, saving took 0.69 seconds 
2020-06-19 00:28:31.004621: done 
2020-06-19 00:28:31.010293: This epoch took 214.934983 s

2020-06-19 00:28:31.015848: 
epoch:  100 
2020-06-19 00:31:46.725742: train loss : -0.3656 
2020-06-19 00:32:03.906664: validation loss: -0.3831 
2020-06-19 00:32:03.913099: Average global foreground Dice: [0.0] 
2020-06-19 00:32:03.918790: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-19 00:32:05.050953: lr: 0.009086 
2020-06-19 00:32:05.059271: At epoch 100, the mean foreground Dice was 0. This can be caused by a too high momentum. High momentum (0.99) is good for datasets where it works, but sometimes causes issues such as this one. Momentum has now been reduced to 0.95 and network weights have been reinitialized 
2020-06-19 00:32:05.064525: This epoch took 214.043097 s

2020-06-19 00:32:05.069352: 
epoch:  101 
2020-06-19 00:35:20.960552: train loss : -0.0557 
2020-06-19 00:35:38.139105: validation loss: -0.1227 
2020-06-19 00:35:38.144560: Average global foreground Dice: [0.22831230592952412] 
2020-06-19 00:35:38.149693: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-19 00:35:39.274402: lr: 0.009077 
2020-06-19 00:35:39.280252: This epoch took 214.206069 s

End of training

2020-06-21 05:50:58.656016: 
epoch:  996 
2020-06-21 05:54:14.633822: train loss : -0.7910 
2020-06-21 05:54:31.794951: validation loss: -0.7501 
2020-06-21 05:54:31.803046: Average global foreground Dice: [0.9625213401203669] 
2020-06-21 05:54:31.808912: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-21 05:54:33.005782: lr: 5.4e-05 
2020-06-21 05:54:33.011025: This epoch took 214.349874 s

2020-06-21 05:54:33.015804: 
epoch:  997 
2020-06-21 05:57:48.806877: train loss : -0.7712 
2020-06-21 05:58:06.014040: validation loss: -0.7651 
2020-06-21 05:58:06.021035: Average global foreground Dice: [0.9624232414211945] 
2020-06-21 05:58:06.025922: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-21 05:58:07.202935: lr: 3.7e-05 
2020-06-21 05:58:07.319246: saving checkpoint... 
2020-06-21 05:58:08.007002: done, saving took 0.80 seconds 
2020-06-21 05:58:08.052021: This epoch took 215.030863 s

2020-06-21 05:58:08.056761: 
epoch:  998 
2020-06-21 06:01:23.827067: train loss : -0.7674 
2020-06-21 06:01:41.026193: validation loss: -0.7399 
2020-06-21 06:01:41.034431: Average global foreground Dice: [0.9532719028205342] 
2020-06-21 06:01:41.040042: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 
2020-06-21 06:01:42.167833: lr: 2e-05 
2020-06-21 06:01:42.173170: This epoch took 214.111457 s
Hhhhhhhzf commented 4 years ago

I see... Maybe I am too urgent...my max epoch was 200, so it ended in epoch 200. Let me try again and reply you. thank you very much!

FabianIsensee commented 4 years ago

Any update on this? Does the training recover? From the training loss in your case that would honestly surprise me. Maybe you have an error in the dataset preparation? Would @mibaumgartner be willing to share his dataset conversion code (you can do a pull request if you like or directly push it to our internal master)?

Hhhhhhhzf commented 4 years ago

Thank you for your attention. @FabianIsensee Now it is still training. As for the error on dataset, I used the same dataset on the fold 5 and it ran well.So does it mean that dataset preparation is ok?

FabianIsensee commented 4 years ago

Hi, I don't know what is going on and I don't have the dataset at hand. If @mibaumgartner provides his conversion code I can have a look, but right now this would take too much time. Your training loss is almost perfectly low, so I don't think it will recover. As to what the problem could be I don't know. Michaels training values (he posted above) look much more reasonable. What you can do in the meantime is also try nnUNetTrainerV2_momentum09 as trainer class. Best, Fabian

mibaumgartner commented 4 years ago

Unfortunately, I exported everything from a different format so the script won't work for nnunet.

Hhhhhhhzf commented 4 years ago

@FabianIsensee I am so sorry that I am too busy during these days.I have tried the nnUNetTrainerV2_momentum09 as my trainer class, however, it ran badly.Here is the picture.

截屏2020-09-07下午6 40 16

I don't know the key problem.It seems that I need to learn more about deep learning during my postgraduate period and then cope with it. Best.

Hhhhhhhzf commented 4 years ago

detail

epoch: 995 2020-09-02 03:20:37.179147: train loss : -0.2678 2020-09-02 03:21:02.593212: validation loss: 0.3344 2020-09-02 03:21:02.594055: Average global foreground Dice: [0.05224742860037623] 2020-09-02 03:21:02.594140: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:21:03.529673: lr: 6.9e-05 2020-09-02 03:21:03.529899: This epoch took 397.650507 s

2020-09-02 03:21:03.529969: epoch: 996 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 2020-09-02 03:27:14.936336: train loss : -0.2767 2020-09-02 03:27:40.383571: validation loss: 0.2705 2020-09-02 03:27:40.384414: Average global foreground Dice: [0.06883198757239095] 2020-09-02 03:27:40.384500: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:27:41.318646: lr: 5.4e-05 2020-09-02 03:27:41.318860: This epoch took 397.788825 s

2020-09-02 03:27:41.318974: epoch: 997 2020-09-02 03:33:52.429279: train loss : -0.2604 2020-09-02 03:34:17.837051: validation loss: 0.4319 2020-09-02 03:34:17.837967: Average global foreground Dice: [0.038072757293865916] 2020-09-02 03:34:17.838058: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:34:18.780670: lr: 3.7e-05 2020-09-02 03:34:18.780899: This epoch took 397.461854 s

2020-09-02 03:34:18.780972: epoch: 998 2020-09-02 03:40:29.885709: train loss : -0.2775 2020-09-02 03:40:55.299693: validation loss: 0.3375 2020-09-02 03:40:55.300536: Average global foreground Dice: [0.05676913047925399] 2020-09-02 03:40:55.300725: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:40:56.245049: lr: 2e-05 2020-09-02 03:40:56.245270: This epoch took 397.464229 s

2020-09-02 03:40:56.245341: epoch: 999 2020-09-02 03:47:06.995399: train loss : -0.2699 2020-09-02 03:47:32.548230: validation loss: 0.3940 2020-09-02 03:47:32.549515: Average global foreground Dice: [0.04414821365246885] 2020-09-02 03:47:32.549712: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:47:33.640711: lr: 0.0 2020-09-02 03:47:33.640894: saving scheduled checkpoint file... 2020-09-02 03:47:33.724259: saving checkpoint... 2020-09-02 03:47:37.348058: done, saving took 3.71 seconds 2020-09-02 03:47:37.364856: done 2020-09-02 03:47:37.365111: This epoch took 401.119699 s

2020-09-02 03:47:37.476116: saving checkpoint... 2020-09-02 03:47:37.761225: done, saving took 0.40 seconds /home/hezhenfeng/local/nnUNet/nnunet/training/network_training/nnUNetTrainer.py:660: RuntimeWarning: invalid value encountered in double_scalars global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in Segmentation_006 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 computing Gaussian prediction done Segmentation_010 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 using precomputed Gaussian prediction done 2020-09-02 03:49:28.163554: finished prediction 2020-09-02 03:49:28.164000: evaluation of raw predictions 2020-09-02 03:49:30.000625: determining postprocessing /home/hezhenfeng/local/nnUNet/nnunet/evaluation/evaluator.py:381: RuntimeWarning: Mean of empty slice all_scores["mean"][label][score] = float(np.nanmean(all_scores["mean"][label][score])) Foreground vs background before: nan after: nan 1 before: 0.0685116698921833 after: 0.06743782807293305 2 before: nan after: nan 3 before: nan after: nan done for which classes: [] min_object_sizes None done force_separate_z: None interpolation order: 3 no resampling necessary force_separate_z: None interpolation order: 3 no resampling necessary

mibaumgartner commented 4 years ago

Did you prepare the dataset with a python script? I you like, I could have a look at it :) I feel like there is something of with the data :)

Hhhhhhhzf commented 4 years ago

my script only changes the name of the original data, just like TASKNAME_001.nii.gz.

截屏2020-09-07下午7 16 24
Hhhhhhhzf commented 4 years ago

What confused me was that the dice value was about 0.9 when I used the command:python run_training.py 3d_fullres nnUNetTrainerV2 Task001_Segmentation 5 here is the log:

2020-08-21 07:10:12.817226: epoch: 144 2020-08-21 07:16:20.552735: train loss : -0.5603 2020-08-21 07:16:45.832160: validation loss: -0.5135 2020-08-21 07:16:45.833006: Average global foreground Dice: [0.9271051271454822, 0.236983842010772, 0.0] 2020-08-21 07:16:45.833125: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:16:46.633925: lr: 0.000468 2020-08-21 07:16:46.634120: This epoch took 393.816834 s

2020-08-21 07:16:46.634187: epoch: 145 2020-08-21 07:22:54.057799: train loss : -0.5511 2020-08-21 07:23:19.394303: validation loss: -0.5233 2020-08-21 07:23:19.395116: Average global foreground Dice: [0.9001284406804405, 0.21346504885643725, 0.0] 2020-08-21 07:23:19.395198: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:23:20.141460: lr: 0.000383 2020-08-21 07:23:20.141664: This epoch took 393.507415 s

2020-08-21 07:23:20.141766: epoch: 146 2020-08-21 07:29:27.737806: train loss : -0.5545 2020-08-21 07:29:53.152463: validation loss: -0.5187 2020-08-21 07:29:53.153816: Average global foreground Dice: [0.8579525824299628, 0.16746909564085882, 0.0] 2020-08-21 07:29:53.153906: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:29:54.006086: lr: 0.000296 2020-08-21 07:29:54.006355: This epoch took 393.864521 s

2020-08-21 07:29:54.006427: epoch: 147 2020-08-21 07:36:01.248874: train loss : -0.5542 2020-08-21 07:36:26.715225: validation loss: -0.5354 2020-08-21 07:36:26.716350: Average global foreground Dice: [0.9051032543905919, 0.3808604356483437, 0.0] 2020-08-21 07:36:26.716586: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:36:27.498340: lr: 0.000205 2020-08-21 07:36:27.498562: This epoch took 393.492071 s

2020-08-21 07:36:27.498670: epoch: 148 2020-08-21 07:42:34.574312: train loss : -0.5554 2020-08-21 07:42:59.945459: validation loss: -0.5234 2020-08-21 07:42:59.946088: Average global foreground Dice: [0.8981972920696325, 0.32827004219409284, 0.0] 2020-08-21 07:42:59.946168: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:43:00.683812: lr: 0.00011 2020-08-21 07:43:00.684023: This epoch took 393.185289 s

2020-08-21 07:43:00.684091: epoch: 149 2020-08-21 07:49:07.817071: train loss : -0.5578 2020-08-21 07:49:33.031224: validation loss: -0.5377 2020-08-21 07:49:33.031980: Average global foreground Dice: [0.9249320755183208, 0.2328559027777778, 0.0] 2020-08-21 07:49:33.032063: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:49:33.801314: lr: 0.0 2020-08-21 07:49:33.801486: saving scheduled checkpoint file... 2020-08-21 07:49:33.877927: saving checkpoint... 2020-08-21 07:49:36.168287: done, saving took 2.37 seconds 2020-08-21 07:49:36.181818: done 2020-08-21 07:49:36.181983: This epoch took 395.497816 s

2020-08-21 07:49:36.268898: saving checkpoint... 2020-08-21 07:49:36.548893: done, saving took 0.37 seconds /home/hezhenfeng/local/nnUNet/nnunet/training/network_training/nnUNetTrainer.py:660: RuntimeWarning: invalid value encountered in double_scalars global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in Segmentation_006 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 computing Gaussian prediction done

the data source was same...

mibaumgartner commented 4 years ago

Are those files located in nnUNet_raw_data/TASK/imagesTr? Your directories should look like this (if you use TASKNAME as a prefix for your files):

    imagesTr
        TASKNAME_001_0000.nii.gz
        TASKNAME_002_0000.nii.gz
        ...
    labelsTr
        TASKNAME_001.nii.gz
        TASKNAME_002.nii.gz
       ...

Note the _0000 ending for the images (this is important) and make sure to use the AXXX_masks.nii.gz andNOT AXXX_labeled_masks.nii.gz from the original folder (the latter one are instance segmentations, but your training probably failed if that was the case).

Hhhhhhhzf commented 4 years ago

my directories and filenames are ok, however, I used the AXXX_labeled_masks.nii.gz... I try it again now. thank you

Hhhhhhhzf commented 4 years ago

@mibaumgartner Your advice is very good and my problem is solved.When I used the AXXX_masks.nii.gz ,the dice value reached about 0.95. Done. Thank you very much!

FabianIsensee commented 4 years ago

Glad to hear it works now!