About training nnUNet in Docker

angryfish9527 commented 2 years ago

I have trained the model on Docker, but I get the following error.

2022-02-09 10:29:56.246803: epoch: 452 Traceback (most recent call last): File "/home/anaconda3/envs/torchenv/bin/nnUNet_train", line 33, in sys.exit(load_entry_point('nnunet', 'console_scripts', 'nnUNet_train')()) File "/home/linfuliang/efficientnnUNet-master/nnunet/run/run_training.py", line 177, in main trainer.run_training() File "/home/linfuliang/efficientnnUNet-master/nnunet/training/network_training/nnUNetTrainerV2.py", line 438, in run_training ret = super().run_training() File "/home/linfuliang/efficientnnUNet-master/nnunet/training/network_training/nnUNetTrainer.py", line 314, in run_training super(nnUNetTrainer, self).run_training() File "/home/linfuliang/efficientnnUNet-master/nnunet/training/network_training/network_trainer.py", line 463, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/linfuliang/efficientnnUNet-master/nnunet/training/network_training/nnUNetTrainerV2.py", line 229, in run_iteration data_dict = next(data_generator) File "/home/anaconda3/envs/torchenv/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in next item = self.get_next_item() File "/home/anaconda3/envs/torchenv/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in get_next_item raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of " RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full Exception in thread Thread-2: Traceback (most recent call last): File "/home/anaconda3/envs/torchenv/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/anaconda3/envs/torchenv/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/anaconda3/envs/torchenv/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: " RuntimeError: Abort event was set. So someone died and we should end this madness. IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full

I guess there are some reasons for this error：1.modified the Generic_UNet 2. The RAM of the Docker I'm using is too small 3. I have used the problematic data set

Can you give me some advice?

FabianIsensee commented 2 years ago

Hey guys I would really like to help you but you keep sending me text outputs without error messages...

FabianIsensee commented 2 years ago

This is how a log including error message looks like:

(/home/fabian/pytorch-v1.8.1_cuda-11.3_cudnn-8.2.0.53/conda_env) fabian@Fabian:~$ nnUNet_train 3d_fullres nnUNetTrainerV2 4 0

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainerV2.nnUNetTrainerV2'> For that I will be using the following configuration: num_classes: 2 modalities: {0: 'MRI'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region None min_region_size_per_class None min_size_per_class None normalization_schemes OrderedDict([(0, 'nonCT')]) stages...

stage: 0 {'batch_size': 9, 'num_pool_per_axis': [3, 3, 3], 'patch_size': array([40, 56, 40]), 'median_patient_size_in_voxels': array([36, 50, 35]), 'current_spacing': array([1., 1., 1.]), 'original_spacing': array([1., 1., 1.]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

I am using stage 0 from these plans I am using sample dice + CE loss

I am using data from this folder: /media/fabian/data/nnUNet_preprocessed/Task004_Hippocampus/nnUNetData_plans_v2.1 ############################################### loading dataset loading all case properties 2022-02-10 16:01:22.362138: Using splits from existing split file: /media/fabian/data/nnUNet_preprocessed/Task004_Hippocampus/splits_final.pkl 2022-02-10 16:01:22.363088: The split file contains 5 splits. 2022-02-10 16:01:22.363122: Desired fold for training: 0 2022-02-10 16:01:22.363148: This split has 208 training and 52 validation cases. unpacking dataset done 2022-02-10 16:01:23.196144: lr was set to: 0.01 using pin_memory on device 0 using pin_memory on device 0 2022-02-10 16:01:27.231644: epoch: 0 2022-02-10 16:01:42.655267: train loss : -0.2253 2022-02-10 16:01:43.517231: validation loss: -0.4970 2022-02-10 16:01:43.517495: Average global foreground Dice: [0.2783, 0.6396] 2022-02-10 16:01:43.517551: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-10 16:01:43.770697: lr was set to: 0.009991 2022-02-10 16:01:43.770777: This epoch took 16.538879 s

2022-02-10 16:01:43.770808: epoch: 1 Exception in background worker 9: [Errno 2] No such file or directory: '/media/fabian/data/nnUNet_preprocessed/Task004_Hippocampus/nnUNetData_plans_v2.1_stage0/hippocampus_003.npz' Traceback (most recent call last): File "/home/fabian/git_repos/dldabg/batchgenerators/dataloading/multi_threaded_augmenter.py", line 46, in producer item = next(data_loader) File "/home/fabian/git_repos/dldabg/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/home/fabian/git_repos/nnunet/nnunet/training/dataloading/dataset_loading.py", line 248, in generate_train_batch case_all_data = np.load(self._data[i]['data_file'])['data'] File "/home/fabian/pytorch-v1.8.1_cuda-11.3_cudnn-8.2.0.53/conda_env/lib/python3.9/site-packages/numpy/lib/npyio.py", line 407, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: '/media/fabian/data/nnUNet_preprocessed/Task004_Hippocampus/nnUNetData_plans_v2.1_stage0/hippocampus_003.npz' Traceback (most recent call last): File "/home/fabian/pytorch-v1.8.1_cuda-11.3_cudnn-8.2.0.53/conda_env/bin/nnUNet_train", line 33, in sys.exit(load_entry_point('nnunet', 'console_scripts', 'nnUNet_train')()) File "/home/fabian/git_repos/nnunet/nnunet/run/run_training.py", line 180, in main trainer.run_training() File "/home/fabian/git_repos/nnunet/nnunet/training/network_training/nnUNetTrainerV2.py", line 441, in run_training ret = super().run_training() File "/home/fabian/git_repos/nnunet/nnunet/training/network_training/nnUNetTrainer.py", line 317, in run_training super(nnUNetTrainer, self).run_training() File "/home/fabian/git_repos/nnunet/nnunet/training/network_training/network_trainer.py", line 456, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/fabian/git_repos/nnunet/nnunet/training/network_training/nnUNetTrainerV2.py", line 233, in run_iteration data_dict = next(data_generator) File "/home/fabian/git_repos/dldabg/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in next item = self.__get_next_item() File "/home/fabian/git_repos/dldabg/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in __get_next_item raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of " RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full

There is a clear error message in there: FileNotFoundError: [Errno 2] No such file or directory: '/media/fabian/data/nnUNet_preprocessed/Task004_Hippocampus/nnUNetData_plans_v2.1_stage0/hippocampus_003.npz' which tells me what the problem is

angryfish9527 commented 2 years ago

nohup: ignoring input

Please cite the following paper when using nnUNet:

Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nat Methods (2020). https://doi.org/10.1038/s41592-020-01008-z

If you have questions or suggestions, feel free to open an issue at https://github.com/MIC-DKFZ/nnUNet

############################################### I am running the following nnUNet: 3d_fullres My trainer class is: <class 'nnunet.training.network_training.nnUNetTrainerV2.nnUNetTrainerV2'> For that I will be using the following configuration: num_classes: 1 modalities: {0: 'CT'} use_mask_for_norm OrderedDict([(0, False)]) keep_only_largest_region None min_region_size_per_class None min_size_per_class None normalization_schemes OrderedDict([(0, 'CT')]) stages...

stage: 0 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 160]), 'median_patient_size_in_voxels': array([134, 268, 268]), 'current_spacing': array([1.19335406, 0.9323074 , 0.9323074 ]), 'original_spacing': array([0.625 , 0.48828101, 0.48828101]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

stage: 1 {'batch_size': 2, 'num_pool_per_axis': [4, 5, 5], 'patch_size': array([ 96, 160, 160]), 'median_patient_size_in_voxels': array([256, 512, 512]), 'current_spacing': array([0.625 , 0.48828101, 0.48828101]), 'original_spacing': array([0.625 , 0.48828101, 0.48828101]), 'do_dummy_2D_data_aug': False, 'pool_op_kernel_sizes': [[2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]]}

I am using stage 1 from these plans I am using batch dice + CE loss

I am using data from this folder: /data3/fanggang/cta_vessel/Task305_1000_calandvesselcombine/nnUNetData_plans_v2.1 ############################################### loading dataset loading all case properties 2022-02-11 09:17:50.082067: Using splits from existing split file: /data3/fanggang/cta_vessel/Task305_1000_calandvesselcombine/splits_final.pkl 2022-02-11 09:17:50.085059: The split file contains 5 splits. 2022-02-11 09:17:50.085175: Desired fold for training: 0 2022-02-11 09:17:50.085224: This split has 718 training and 180 validation cases. unpacking dataset done 2022-02-11 09:17:58.257029: lr: 0.01 using pin_memory on device 0 using pin_memory on device 0

2022-02-11 09:18:09.152690: epoch: 0 2022-02-11 09:20:29.082275: train loss : -0.4332 2022-02-11 09:20:40.825708: validation loss: -0.6338 2022-02-11 09:20:40.826378: Average global foreground Dice: [0.8078] 2022-02-11 09:20:40.826494: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:20:41.913556: lr: 0.009991 2022-02-11 09:20:41.913980: This epoch took 152.761253 s

2022-02-11 09:20:41.914099: epoch: 1 2022-02-11 09:22:47.964180: train loss : -0.5639 2022-02-11 09:23:00.741184: validation loss: -0.6749 2022-02-11 09:23:00.741957: Average global foreground Dice: [0.8421] 2022-02-11 09:23:00.742047: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:23:01.992512: lr: 0.009982 2022-02-11 09:23:02.072473: saving checkpoint... 2022-02-11 09:23:02.451307: done, saving took 0.46 seconds 2022-02-11 09:23:02.499585: This epoch took 140.585366 s

2022-02-11 09:23:02.499779: epoch: 2 2022-02-11 09:25:24.599138: train loss : -0.5912 2022-02-11 09:25:38.511476: validation loss: -0.7486 2022-02-11 09:25:38.512246: Average global foreground Dice: [0.8761] 2022-02-11 09:25:38.512336: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:25:39.639386: lr: 0.009973 2022-02-11 09:25:39.699288: saving checkpoint... 2022-02-11 09:25:40.068100: done, saving took 0.43 seconds 2022-02-11 09:25:40.110016: This epoch took 157.610054 s

2022-02-11 09:25:40.110149: epoch: 3 2022-02-11 09:27:45.079270: train loss : -0.5849 2022-02-11 09:27:55.140415: validation loss: -0.6940 2022-02-11 09:27:55.140990: Average global foreground Dice: [0.8462] 2022-02-11 09:27:55.141077: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:27:56.281115: lr: 0.009964 2022-02-11 09:27:56.351244: saving checkpoint... 2022-02-11 09:27:56.750372: done, saving took 0.47 seconds 2022-02-11 09:27:56.809391: This epoch took 136.699201 s

2022-02-11 09:27:56.809584: epoch: 4 2022-02-11 09:30:01.757490: train loss : -0.6012 2022-02-11 09:30:12.265868: validation loss: -0.6806 2022-02-11 09:30:12.266482: Average global foreground Dice: [0.8184] 2022-02-11 09:30:12.266568: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:30:13.403486: lr: 0.009955 2022-02-11 09:30:13.403669: This epoch took 136.594050 s

2022-02-11 09:30:13.403702: epoch: 5 2022-02-11 09:32:29.224652: train loss : -0.6192 2022-02-11 09:32:43.045056: validation loss: -0.6567 2022-02-11 09:32:43.045667: Average global foreground Dice: [0.762] 2022-02-11 09:32:43.045728: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:32:44.206762: lr: 0.009946 2022-02-11 09:32:44.207016: This epoch took 150.803281 s

2022-02-11 09:32:44.207069: epoch: 6 2022-02-11 09:34:49.625515: train loss : -0.6584 2022-02-11 09:34:59.688008: validation loss: -0.7226 2022-02-11 09:34:59.688808: Average global foreground Dice: [0.8837] 2022-02-11 09:34:59.688904: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:35:00.857095: lr: 0.009937 2022-02-11 09:35:00.917268: saving checkpoint... 2022-02-11 09:35:01.290960: done, saving took 0.43 seconds 2022-02-11 09:35:01.343246: This epoch took 137.136132 s

2022-02-11 09:35:01.343388: epoch: 7 2022-02-11 09:37:06.546855: train loss : -0.6505 2022-02-11 09:37:18.553614: validation loss: -0.6722 2022-02-11 09:37:18.554376: Average global foreground Dice: [0.8203] 2022-02-11 09:37:18.554436: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:37:19.897903: lr: 0.009928 2022-02-11 09:37:19.898163: This epoch took 138.554727 s

2022-02-11 09:37:19.898196: epoch: 8 2022-02-11 09:39:24.112771: train loss : -0.6745 2022-02-11 09:39:36.050548: validation loss: -0.7191 2022-02-11 09:39:36.051413: Average global foreground Dice: [0.8537] 2022-02-11 09:39:36.051760: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:39:37.252870: lr: 0.009919 2022-02-11 09:39:37.284860: saving checkpoint... 2022-02-11 09:39:37.615195: done, saving took 0.36 seconds 2022-02-11 09:39:37.645676: This epoch took 137.747446 s

2022-02-11 09:39:37.646020: epoch: 9 2022-02-11 09:41:42.229952: train loss : -0.6922 2022-02-11 09:41:55.654238: validation loss: -0.7442 2022-02-11 09:41:55.654968: Average global foreground Dice: [0.8655] 2022-02-11 09:41:55.655629: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:41:56.829160: lr: 0.00991 2022-02-11 09:41:56.866686: saving checkpoint... 2022-02-11 09:41:57.284900: done, saving took 0.46 seconds 2022-02-11 09:41:57.333479: This epoch took 139.687385 s

2022-02-11 09:41:57.333617: epoch: 10 2022-02-11 09:44:01.653314: train loss : -0.7023 2022-02-11 09:44:11.779436: validation loss: -0.7840 2022-02-11 09:44:11.780008: Average global foreground Dice: [0.9031] 2022-02-11 09:44:11.780066: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:44:12.930392: lr: 0.009901 2022-02-11 09:44:12.978687: saving checkpoint... 2022-02-11 09:44:13.332479: done, saving took 0.40 seconds 2022-02-11 09:44:13.373284: This epoch took 136.039630 s

2022-02-11 09:44:13.373482: epoch: 11 2022-02-11 09:46:17.589069: train loss : -0.7147 2022-02-11 09:46:29.035957: validation loss: -0.6877 2022-02-11 09:46:29.036994: Average global foreground Dice: [0.8762] 2022-02-11 09:46:29.037508: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:46:30.351375: lr: 0.009892 2022-02-11 09:46:30.446559: saving checkpoint... 2022-02-11 09:46:31.133463: done, saving took 0.78 seconds 2022-02-11 09:46:31.171409: This epoch took 137.797883 s

2022-02-11 09:46:31.171607: epoch: 12 2022-02-11 09:48:37.859023: train loss : -0.7078 2022-02-11 09:48:47.846456: validation loss: -0.6971 2022-02-11 09:48:47.846957: Average global foreground Dice: [0.8366] 2022-02-11 09:48:47.847012: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:48:49.039341: lr: 0.009883 2022-02-11 09:48:49.039597: This epoch took 137.867954 s

2022-02-11 09:48:49.039667: epoch: 13 2022-02-11 09:50:54.730199: train loss : -0.7057 2022-02-11 09:51:06.086955: validation loss: -0.6657 2022-02-11 09:51:06.087431: Average global foreground Dice: [0.8214] 2022-02-11 09:51:06.087485: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:51:07.423136: lr: 0.009874 2022-02-11 09:51:07.423356: This epoch took 138.383636 s

2022-02-11 09:51:07.423391: epoch: 14 2022-02-11 09:53:11.971455: train loss : -0.7152 2022-02-11 09:53:21.868807: validation loss: -0.7693 2022-02-11 09:53:21.869602: Average global foreground Dice: [0.8595] 2022-02-11 09:53:21.869758: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:53:23.029884: lr: 0.009865 2022-02-11 09:53:23.030070: This epoch took 135.606647 s

2022-02-11 09:53:23.030102: epoch: 15 2022-02-11 09:55:28.176704: train loss : -0.7549 2022-02-11 09:55:39.543091: validation loss: -0.7525 2022-02-11 09:55:39.543669: Average global foreground Dice: [0.8705] 2022-02-11 09:55:39.543753: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 09:55:40.712478: lr: 0.009856 2022-02-11 09:55:40.754651: saving checkpoint... 2022-02-11 09:55:41.025641: done, saving took 0.31 seconds 2022-02-11 09:55:41.064976: This epoch took 138.034821 s

2022-02-11 09:55:41.065511: epoch: 16 2022-02-11 09:57:45.812597: train loss : -0.7268 2022-02-11 09:57:55.649310: validation loss: -0.7379 2022-02-11 09:57:55.650088: Average global foreground Dice: [0.8885] 2022-02-11 09:57:55.650669: (interpret this as an estimate for the Dice of the different classes. This is not exact.) /home/linfuliang/nnUNet-master/nnunet/training/network_training/nnUNetTrainerV2.py:254: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_gradnorm; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. torch.nn.utils.clip_gradnorm(self.network.parameters(), 12) Exception in thread Thread-4: Traceback (most recent call last): File "/home/anaconda3/envs/nnUNet/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/anaconda3/envs/nnUNet/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/anaconda3/envs/nnUNet/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: " RuntimeError: Abort event was set. So someone died and we should end this madness. IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full 2022-02-11 09:57:56.817736: lr: 0.009847 2022-02-11 09:57:56.855858: saving checkpoint... 2022-02-11 09:57:57.217045: done, saving took 0.40 seconds 2022-02-11 09:57:57.259936: This epoch took 136.194038 s

2022-02-11 09:57:57.260094: epoch: 17 2022-02-11 10:00:02.490452: train loss : -0.7698 2022-02-11 10:00:13.650408: validation loss: -0.7660 2022-02-11 10:00:13.651247: Average global foreground Dice: [0.9079] 2022-02-11 10:00:13.651390: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:00:14.822755: lr: 0.009838 2022-02-11 10:00:14.867948: saving checkpoint... 2022-02-11 10:00:15.156437: done, saving took 0.33 seconds 2022-02-11 10:00:15.194987: This epoch took 137.934852 s

2022-02-11 10:00:15.195269: epoch: 18 2022-02-11 10:02:20.056442: train loss : -0.7564 2022-02-11 10:02:30.151413: validation loss: -0.7768 2022-02-11 10:02:30.151976: Average global foreground Dice: [0.9002] 2022-02-11 10:02:30.152059: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:02:31.315049: lr: 0.009829 2022-02-11 10:02:31.356308: saving checkpoint... 2022-02-11 10:02:31.813298: done, saving took 0.50 seconds 2022-02-11 10:02:31.853316: This epoch took 136.657974 s

2022-02-11 10:02:31.853536: epoch: 19 2022-02-11 10:04:35.643142: train loss : -0.7363 2022-02-11 10:04:46.299122: validation loss: -0.7499 2022-02-11 10:04:46.299602: Average global foreground Dice: [0.8811] 2022-02-11 10:04:46.299661: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:04:47.462837: lr: 0.00982 2022-02-11 10:04:47.508592: saving checkpoint... 2022-02-11 10:04:47.808908: done, saving took 0.35 seconds 2022-02-11 10:04:47.842048: This epoch took 135.988467 s

2022-02-11 10:04:47.843226: epoch: 20 2022-02-11 10:06:51.569437: train loss : -0.7655 2022-02-11 10:07:01.440178: validation loss: -0.7671 2022-02-11 10:07:01.440955: Average global foreground Dice: [0.8821] 2022-02-11 10:07:01.441039: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:07:02.631546: lr: 0.009811 2022-02-11 10:07:02.668007: saving checkpoint... 2022-02-11 10:07:03.011033: done, saving took 0.38 seconds 2022-02-11 10:07:03.044968: This epoch took 135.201592 s

2022-02-11 10:07:03.045224: epoch: 21 2022-02-11 10:09:06.672966: train loss : -0.7875 2022-02-11 10:09:16.461107: validation loss: -0.8057 2022-02-11 10:09:16.461621: Average global foreground Dice: [0.9128] 2022-02-11 10:09:16.461686: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:09:17.630290: lr: 0.009802 2022-02-11 10:09:17.675798: saving checkpoint... 2022-02-11 10:09:18.018679: done, saving took 0.39 seconds 2022-02-11 10:09:18.057406: This epoch took 135.012054 s

2022-02-11 10:09:18.057591: epoch: 22 2022-02-11 10:11:21.612100: train loss : -0.7341 2022-02-11 10:11:33.679884: validation loss: -0.7715 2022-02-11 10:11:33.680546: Average global foreground Dice: [0.8959] 2022-02-11 10:11:33.680601: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:11:34.839550: lr: 0.009793 2022-02-11 10:11:34.874500: saving checkpoint... 2022-02-11 10:11:35.154852: done, saving took 0.32 seconds 2022-02-11 10:11:35.199019: This epoch took 137.141391 s

2022-02-11 10:11:35.199210: epoch: 23 2022-02-11 10:13:38.863709: train loss : -0.7605 2022-02-11 10:13:51.783009: validation loss: -0.8026 2022-02-11 10:13:51.783546: Average global foreground Dice: [0.9209] 2022-02-11 10:13:51.783602: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:13:52.943713: lr: 0.009784 2022-02-11 10:13:53.003130: saving checkpoint... 2022-02-11 10:13:53.310054: done, saving took 0.37 seconds 2022-02-11 10:13:53.350605: This epoch took 138.151310 s

2022-02-11 10:13:53.350756: epoch: 24 2022-02-11 10:15:57.912349: train loss : -0.7733 2022-02-11 10:16:07.962949: validation loss: -0.7798 2022-02-11 10:16:07.963555: Average global foreground Dice: [0.8859] 2022-02-11 10:16:07.963631: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:16:09.137930: lr: 0.009775 2022-02-11 10:16:09.198899: saving checkpoint... 2022-02-11 10:16:09.667645: done, saving took 0.53 seconds 2022-02-11 10:16:09.707963: This epoch took 136.357024 s

2022-02-11 10:16:09.708138: epoch: 25 2022-02-11 10:18:13.527353: train loss : -0.7936 2022-02-11 10:18:23.450829: validation loss: -0.8471 2022-02-11 10:18:23.451422: Average global foreground Dice: [0.9425] 2022-02-11 10:18:23.451495: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:18:24.628978: lr: 0.009766 2022-02-11 10:18:24.656310: saving checkpoint... 2022-02-11 10:18:24.917256: done, saving took 0.29 seconds 2022-02-11 10:18:24.942403: This epoch took 135.234227 s

2022-02-11 10:18:24.942856: epoch: 26 2022-02-11 10:20:28.458615: train loss : -0.7629 2022-02-11 10:20:38.425654: validation loss: -0.7592 2022-02-11 10:20:38.426527: Average global foreground Dice: [0.8703] 2022-02-11 10:20:38.426879: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:20:39.596311: lr: 0.009757 2022-02-11 10:20:39.596771: This epoch took 134.653772 s

2022-02-11 10:20:39.596849: epoch: 27 2022-02-11 10:22:43.416006: train loss : -0.7908 2022-02-11 10:22:54.883610: validation loss: -0.7993 2022-02-11 10:22:54.884174: Average global foreground Dice: [0.9001] 2022-02-11 10:22:54.884259: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:22:56.060037: lr: 0.009748 2022-02-11 10:22:56.105804: saving checkpoint... 2022-02-11 10:22:56.378557: done, saving took 0.32 seconds 2022-02-11 10:22:56.413364: This epoch took 136.816443 s

2022-02-11 10:22:56.413565: epoch: 28 2022-02-11 10:24:59.971015: train loss : -0.7849 2022-02-11 10:25:11.606017: validation loss: -0.8104 2022-02-11 10:25:11.606717: Average global foreground Dice: [0.9239] 2022-02-11 10:25:11.606846: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:25:12.765421: lr: 0.009739 2022-02-11 10:25:12.804247: saving checkpoint... 2022-02-11 10:25:13.134019: done, saving took 0.37 seconds 2022-02-11 10:25:13.171309: This epoch took 136.757682 s

2022-02-11 10:25:13.171520: epoch: 29 2022-02-11 10:27:16.768724: train loss : -0.7886 2022-02-11 10:27:28.673146: validation loss: -0.8252 2022-02-11 10:27:28.674062: Average global foreground Dice: [0.928] 2022-02-11 10:27:28.674335: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2022-02-11 10:27:29.854883: lr: 0.00973 2022-02-11 10:27:29.921238: saving checkpoint... 2022-02-11 10:27:30.302154: done, saving took 0.45 seconds 2022-02-11 10:27:30.347545: This epoch took 137.175985 s

2022-02-11 10:27:30.347724: epoch: 30 Traceback (most recent call last): File "/home/anaconda3/envs/nnUNet/bin/nnUNet_train", line 33, in sys.exit(load_entry_point('nnunet', 'console_scripts', 'nnUNet_train')()) File "/home/linfuliang/nnUNet-master/nnunet/run/run_training.py", line 181, in main trainer.run_training() File "/home/linfuliang/nnUNet-master/nnunet/training/network_training/nnUNetTrainerV2.py", line 440, in run_training ret = super().run_training() File "/home/linfuliang/nnUNet-master/nnunet/training/network_training/nnUNetTrainer.py", line 317, in run_training super(nnUNetTrainer, self).run_training() File "/home/linfuliang/nnUNet-master/nnunet/training/network_training/network_trainer.py", line 456, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/linfuliang/nnUNet-master/nnunet/training/network_training/nnUNetTrainerV2.py", line 232, in run_iteration data_dict = next(data_generator) File "/home/anaconda3/envs/nnUNet/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 206, in next item = self.get_next_item() File "/home/anaconda3/envs/nnUNet/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 190, in get_next_item raise RuntimeError("MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of " RuntimeError: MultiThreadedAugmenter.abort_event was set, something went wrong. Maybe one of your workers crashed. This is not the actual error message! Look further up your stdout to see what caused the error. Please also check whether your RAM was full Exception in thread Thread-5: Traceback (most recent call last): File "/home/anaconda3/envs/nnUNet/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/anaconda3/envs/nnUNet/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/anaconda3/envs/nnUNet/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("Abort event was set. So someone died and we should end this madness. \nIMPORTANT: " RuntimeError: Abort event was set. So someone died and we should end this madness. IMPORTANT: This is not the actual error message! Look further up to see what caused the error. Please also check whether your RAM was full

This is the completion log of my training with the nnUNet, there are no other errors reported in it.

mertyergin commented 2 years ago

Hi,

Docker's default shared memory size is not enough for multiprocess preprocessing. If you didn't set the --shm-size argument, I recommend you to run docker wtih --shm-size=24gb.

FabianIsensee commented 2 years ago

@duytq99 can you please verify that this problem also happens with the default trainer? RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR is super generic and I don't know how to solve this, especially not if you use custom code. The default nnU-Net will probably work. If not, then you probably have a problem with your pytorch setup

duytq99 commented 2 years ago

@FabianIsensee Hello. I solved the problem. I made a mistake in the custom model where the final output shape is not compatible with the nnUNet (kind of typo mistake in num_classes). The default trainer worked well, also the custom. Thank you so much!

FabianIsensee commented 2 years ago

glad to hear it works now

MIC-DKFZ / nnUNet

About training nnUNet in Docker #938