Closed Hhhhhhhzf closed 4 years ago
when I run 5fold, the dice gets higher, about0.6. However it is still not steady.
Hi, I don't know this dataset. I believe one of my colleagues applied nnU-Net to it and that went quite well. Maybe he can help out @mibaumgartner ? Best, Fabian
Hi, I ran nnU-Net (3d fullres) on the first task of the challenge (the data and annotations should be the same though :) ).
Fold 0 was slightly tricky because SGD with high momentum did not work out in my run (this resulted in a low dice during the first few epochs and 0 until epoch 100). This case is also covered in the code and it 'restarts' the training with a lower momentum which works much better for this fold (takes around 20 epochs [so Epoch 120 in the training] to reach dice >0.80). Towards the end of the training it should be >0.90 ..
@Hhhhhhhzf Could you please provide the output of the log file at epoch (99, 100, 101)?
thank you @FabianIsensee @mibaumgartner . this is my log file
Seems like your dice was not bad enough to be considered for the restart :-) Did you let the training finish?
Your train loss is very low, so it seems like it is training properly. Really interesting problem.
Have you tried ´nnUNetTrainerV2_momentum095´?
Best, Fabian
Just checked the loss in my runs and it looks quite different (good catch @FabianIsensee ^^)
Around Epoch 100
2020-06-19 00:24:56.069414:
epoch: 99
2020-06-19 00:28:11.925901: train loss : -0.3417
2020-06-19 00:28:29.115911: validation loss: -0.3713
2020-06-19 00:28:29.122903: Average global foreground Dice: [0.0]
2020-06-19 00:28:29.128674: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-19 00:28:30.262686: lr: 0.009095
2020-06-19 00:28:30.267665: saving scheduled checkpoint file...
2020-06-19 00:28:30.384209: saving checkpoint...
2020-06-19 00:28:30.966592: done, saving took 0.69 seconds
2020-06-19 00:28:31.004621: done
2020-06-19 00:28:31.010293: This epoch took 214.934983 s
2020-06-19 00:28:31.015848:
epoch: 100
2020-06-19 00:31:46.725742: train loss : -0.3656
2020-06-19 00:32:03.906664: validation loss: -0.3831
2020-06-19 00:32:03.913099: Average global foreground Dice: [0.0]
2020-06-19 00:32:03.918790: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-19 00:32:05.050953: lr: 0.009086
2020-06-19 00:32:05.059271: At epoch 100, the mean foreground Dice was 0. This can be caused by a too high momentum. High momentum (0.99) is good for datasets where it works, but sometimes causes issues such as this one. Momentum has now been reduced to 0.95 and network weights have been reinitialized
2020-06-19 00:32:05.064525: This epoch took 214.043097 s
2020-06-19 00:32:05.069352:
epoch: 101
2020-06-19 00:35:20.960552: train loss : -0.0557
2020-06-19 00:35:38.139105: validation loss: -0.1227
2020-06-19 00:35:38.144560: Average global foreground Dice: [0.22831230592952412]
2020-06-19 00:35:38.149693: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-19 00:35:39.274402: lr: 0.009077
2020-06-19 00:35:39.280252: This epoch took 214.206069 s
End of training
2020-06-21 05:50:58.656016:
epoch: 996
2020-06-21 05:54:14.633822: train loss : -0.7910
2020-06-21 05:54:31.794951: validation loss: -0.7501
2020-06-21 05:54:31.803046: Average global foreground Dice: [0.9625213401203669]
2020-06-21 05:54:31.808912: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-21 05:54:33.005782: lr: 5.4e-05
2020-06-21 05:54:33.011025: This epoch took 214.349874 s
2020-06-21 05:54:33.015804:
epoch: 997
2020-06-21 05:57:48.806877: train loss : -0.7712
2020-06-21 05:58:06.014040: validation loss: -0.7651
2020-06-21 05:58:06.021035: Average global foreground Dice: [0.9624232414211945]
2020-06-21 05:58:06.025922: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-21 05:58:07.202935: lr: 3.7e-05
2020-06-21 05:58:07.319246: saving checkpoint...
2020-06-21 05:58:08.007002: done, saving took 0.80 seconds
2020-06-21 05:58:08.052021: This epoch took 215.030863 s
2020-06-21 05:58:08.056761:
epoch: 998
2020-06-21 06:01:23.827067: train loss : -0.7674
2020-06-21 06:01:41.026193: validation loss: -0.7399
2020-06-21 06:01:41.034431: Average global foreground Dice: [0.9532719028205342]
2020-06-21 06:01:41.040042: (interpret this as an estimate for the Dice of the different classes. This is not exact.)
2020-06-21 06:01:42.167833: lr: 2e-05
2020-06-21 06:01:42.173170: This epoch took 214.111457 s
I see... Maybe I am too urgent...my max epoch was 200, so it ended in epoch 200. Let me try again and reply you. thank you very much!
Any update on this? Does the training recover? From the training loss in your case that would honestly surprise me. Maybe you have an error in the dataset preparation? Would @mibaumgartner be willing to share his dataset conversion code (you can do a pull request if you like or directly push it to our internal master)?
Thank you for your attention. @FabianIsensee Now it is still training. As for the error on dataset, I used the same dataset on the fold 5 and it ran well.So does it mean that dataset preparation is ok?
Hi,
I don't know what is going on and I don't have the dataset at hand. If @mibaumgartner provides his conversion code I can have a look, but right now this would take too much time. Your training loss is almost perfectly low, so I don't think it will recover. As to what the problem could be I don't know. Michaels training values (he posted above) look much more reasonable.
What you can do in the meantime is also try nnUNetTrainerV2_momentum09
as trainer class.
Best,
Fabian
Unfortunately, I exported everything from a different format so the script won't work for nnunet.
@FabianIsensee I am so sorry that I am too busy during these days.I have tried the nnUNetTrainerV2_momentum09 as my trainer class, however, it ran badly.Here is the picture.
I don't know the key problem.It seems that I need to learn more about deep learning during my postgraduate period and then cope with it. Best.
detail
epoch: 995 2020-09-02 03:20:37.179147: train loss : -0.2678 2020-09-02 03:21:02.593212: validation loss: 0.3344 2020-09-02 03:21:02.594055: Average global foreground Dice: [0.05224742860037623] 2020-09-02 03:21:02.594140: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:21:03.529673: lr: 6.9e-05 2020-09-02 03:21:03.529899: This epoch took 397.650507 s
2020-09-02 03:21:03.529969: epoch: 996 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0 2020-09-02 03:27:14.936336: train loss : -0.2767 2020-09-02 03:27:40.383571: validation loss: 0.2705 2020-09-02 03:27:40.384414: Average global foreground Dice: [0.06883198757239095] 2020-09-02 03:27:40.384500: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:27:41.318646: lr: 5.4e-05 2020-09-02 03:27:41.318860: This epoch took 397.788825 s
2020-09-02 03:27:41.318974: epoch: 997 2020-09-02 03:33:52.429279: train loss : -0.2604 2020-09-02 03:34:17.837051: validation loss: 0.4319 2020-09-02 03:34:17.837967: Average global foreground Dice: [0.038072757293865916] 2020-09-02 03:34:17.838058: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:34:18.780670: lr: 3.7e-05 2020-09-02 03:34:18.780899: This epoch took 397.461854 s
2020-09-02 03:34:18.780972: epoch: 998 2020-09-02 03:40:29.885709: train loss : -0.2775 2020-09-02 03:40:55.299693: validation loss: 0.3375 2020-09-02 03:40:55.300536: Average global foreground Dice: [0.05676913047925399] 2020-09-02 03:40:55.300725: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:40:56.245049: lr: 2e-05 2020-09-02 03:40:56.245270: This epoch took 397.464229 s
2020-09-02 03:40:56.245341: epoch: 999 2020-09-02 03:47:06.995399: train loss : -0.2699 2020-09-02 03:47:32.548230: validation loss: 0.3940 2020-09-02 03:47:32.549515: Average global foreground Dice: [0.04414821365246885] 2020-09-02 03:47:32.549712: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-09-02 03:47:33.640711: lr: 0.0 2020-09-02 03:47:33.640894: saving scheduled checkpoint file... 2020-09-02 03:47:33.724259: saving checkpoint... 2020-09-02 03:47:37.348058: done, saving took 3.71 seconds 2020-09-02 03:47:37.364856: done 2020-09-02 03:47:37.365111: This epoch took 401.119699 s
2020-09-02 03:47:37.476116: saving checkpoint... 2020-09-02 03:47:37.761225: done, saving took 0.40 seconds /home/hezhenfeng/local/nnUNet/nnunet/training/network_training/nnUNetTrainer.py:660: RuntimeWarning: invalid value encountered in double_scalars global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in Segmentation_006 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 computing Gaussian prediction done Segmentation_010 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 using precomputed Gaussian prediction done 2020-09-02 03:49:28.163554: finished prediction 2020-09-02 03:49:28.164000: evaluation of raw predictions 2020-09-02 03:49:30.000625: determining postprocessing /home/hezhenfeng/local/nnUNet/nnunet/evaluation/evaluator.py:381: RuntimeWarning: Mean of empty slice all_scores["mean"][label][score] = float(np.nanmean(all_scores["mean"][label][score])) Foreground vs background before: nan after: nan 1 before: 0.0685116698921833 after: 0.06743782807293305 2 before: nan after: nan 3 before: nan after: nan done for which classes: [] min_object_sizes None done force_separate_z: None interpolation order: 3 no resampling necessary force_separate_z: None interpolation order: 3 no resampling necessary
Did you prepare the dataset with a python script? I you like, I could have a look at it :) I feel like there is something of with the data :)
my script only changes the name of the original data, just like TASKNAME_001.nii.gz.
What confused me was that the dice value was about 0.9 when I used the command:python run_training.py 3d_fullres nnUNetTrainerV2 Task001_Segmentation 5
here is the log:
2020-08-21 07:10:12.817226: epoch: 144 2020-08-21 07:16:20.552735: train loss : -0.5603 2020-08-21 07:16:45.832160: validation loss: -0.5135 2020-08-21 07:16:45.833006: Average global foreground Dice: [0.9271051271454822, 0.236983842010772, 0.0] 2020-08-21 07:16:45.833125: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:16:46.633925: lr: 0.000468 2020-08-21 07:16:46.634120: This epoch took 393.816834 s
2020-08-21 07:16:46.634187: epoch: 145 2020-08-21 07:22:54.057799: train loss : -0.5511 2020-08-21 07:23:19.394303: validation loss: -0.5233 2020-08-21 07:23:19.395116: Average global foreground Dice: [0.9001284406804405, 0.21346504885643725, 0.0] 2020-08-21 07:23:19.395198: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:23:20.141460: lr: 0.000383 2020-08-21 07:23:20.141664: This epoch took 393.507415 s
2020-08-21 07:23:20.141766: epoch: 146 2020-08-21 07:29:27.737806: train loss : -0.5545 2020-08-21 07:29:53.152463: validation loss: -0.5187 2020-08-21 07:29:53.153816: Average global foreground Dice: [0.8579525824299628, 0.16746909564085882, 0.0] 2020-08-21 07:29:53.153906: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:29:54.006086: lr: 0.000296 2020-08-21 07:29:54.006355: This epoch took 393.864521 s
2020-08-21 07:29:54.006427: epoch: 147 2020-08-21 07:36:01.248874: train loss : -0.5542 2020-08-21 07:36:26.715225: validation loss: -0.5354 2020-08-21 07:36:26.716350: Average global foreground Dice: [0.9051032543905919, 0.3808604356483437, 0.0] 2020-08-21 07:36:26.716586: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:36:27.498340: lr: 0.000205 2020-08-21 07:36:27.498562: This epoch took 393.492071 s
2020-08-21 07:36:27.498670: epoch: 148 2020-08-21 07:42:34.574312: train loss : -0.5554 2020-08-21 07:42:59.945459: validation loss: -0.5234 2020-08-21 07:42:59.946088: Average global foreground Dice: [0.8981972920696325, 0.32827004219409284, 0.0] 2020-08-21 07:42:59.946168: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:43:00.683812: lr: 0.00011 2020-08-21 07:43:00.684023: This epoch took 393.185289 s
2020-08-21 07:43:00.684091: epoch: 149 2020-08-21 07:49:07.817071: train loss : -0.5578 2020-08-21 07:49:33.031224: validation loss: -0.5377 2020-08-21 07:49:33.031980: Average global foreground Dice: [0.9249320755183208, 0.2328559027777778, 0.0] 2020-08-21 07:49:33.032063: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2020-08-21 07:49:33.801314: lr: 0.0 2020-08-21 07:49:33.801486: saving scheduled checkpoint file... 2020-08-21 07:49:33.877927: saving checkpoint... 2020-08-21 07:49:36.168287: done, saving took 2.37 seconds 2020-08-21 07:49:36.181818: done 2020-08-21 07:49:36.181983: This epoch took 395.497816 s
2020-08-21 07:49:36.268898: saving checkpoint... 2020-08-21 07:49:36.548893: done, saving took 0.37 seconds /home/hezhenfeng/local/nnUNet/nnunet/training/network_training/nnUNetTrainer.py:660: RuntimeWarning: invalid value encountered in double_scalars global_dc_per_class = [i for i in [2 i / (2 i + j + k) for i, j, k in Segmentation_006 (2, 220, 256, 256) debug: mirroring True mirror_axes (0, 1, 2) step_size: 0.5 do mirror: True data shape: (1, 220, 256, 256) patch size: [128 128 128] steps (x, y, and z): [[0, 46, 92], [0, 64, 128], [0, 64, 128]] number of tiles: 27 computing Gaussian prediction done
the data source was same...
Are those files located in nnUNet_raw_data/TASK/imagesTr
?
Your directories should look like this (if you use TASKNAME as a prefix for your files):
imagesTr
TASKNAME_001_0000.nii.gz
TASKNAME_002_0000.nii.gz
...
labelsTr
TASKNAME_001.nii.gz
TASKNAME_002.nii.gz
...
Note the _0000
ending for the images (this is important) and make sure to use the AXXX_masks.nii.gz
andNOT AXXX_labeled_masks.nii.gz
from the original folder (the latter one are instance segmentations, but your training probably failed if that was the case).
my directories and filenames are ok, however, I used the AXXX_labeled_masks.nii.gz
...
I try it again now.
thank you
@mibaumgartner Your advice is very good and my problem is solved.When I used the AXXX_masks.nii.gz
,the dice value reached about 0.95.
Done.
Thank you very much!
Glad to hear it works now!
Hi fabian, I have come across this problem that the dice is too low(only about 0.1) though epoch gets 200.
I tried the example task, Task04_Hippocampus before.It ran well, the dice 0.90. my datasets come from CADA-AS: Cerebral Aneurysm Segmentation, as the picture below shows. PS:the filenames was formatted before train.
And I run these commands
python nnUNet_convert_decathlon_task.py -i MY_FOLDER
python nnUNet_plan_and_preprocess.py -t 1 --verify_dataset_integrity
python run_training.py 3d_fullres nnUNetTrainerV2 Task_Name 0
Now I can't cope with this problem and don't know the key problem, so need your help. thank you very much!
best zfh