MIC-DKFZ / nnUNet

Apache License 2.0
5.87k stars 1.76k forks source link

RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #1898

Closed omaruus99 closed 9 months ago

omaruus99 commented 10 months ago

Hello @FabianIsensee I get this error when I run training inside a docker container: root@cc5d09285d9b:/nnunet# nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

This is the configuration used by this training: Configuration name: 3d_fullres {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 9, 'patch_size': [40, 56, 40], 'median_image_size_in_voxels': [36.0, 50.0, 35.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2], 'num_pool_per_axis': [3, 3, 3], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}

These are the global plan.json settings: {'dataset_name': 'Dataset004_Hippocampus', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [36, 50, 35], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 486420.21875, 'mean': 22360.326171875, 'median': 362.88250732421875, 'min': 0.0, 'percentile_00_5': 28.0, 'percentile_99_5': 277682.03125, 'std': 60656.1328125}}}

2024-01-16 09:46:10.106258: unpacking dataset... 2024-01-16 09:46:22.449987: unpacking done... 2024-01-16 09:46:22.480390: do_dummy_2d_data_aug: False 2024-01-16 09:46:22.521797: Creating new 5-fold cross-validation split... 2024-01-16 09:46:22.581434: Desired fold for training: 0 2024-01-16 09:46:22.603892: This split has 208 training and 52 validation cases. 2024-01-16 09:46:23.022059: Unable to plot network architecture: 2024-01-16 09:46:23.042216: No module named 'hiddenlayer' 2024-01-16 09:46:23.131182: 2024-01-16 09:46:23.149663: Epoch 0 2024-01-16 09:46:23.171872: Current learning rate: 0.01 using pin_memory on device 0 Exception in thread Thread-5: Traceback (most recent call last): File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/usr/local/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/usr/local/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/nnunet/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/nnunet/nnunetv2/run/run_training.py", line 204, in run_training nnunet_trainer.run_training() File "/nnunet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1275, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

For information : i have the same error when i use : OMP_NUM_THREADS=1 nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer

TaWald commented 9 months ago

Does this only happen in the docker and not on your local machine? It definetly seems like you are unable to load the images properly with your dataloaders. Usually the error in multiprocessing can be very obfuscated. I would recommend you to replace the multi processed dataloaded with a SingleThreadedAugmenter, as it will provide you and me with more information on what the reason here is.

FabianIsensee commented 9 months ago

Have you added --ipc=host or increased the shared memory amount?

revanb88 commented 9 months ago

Hi, The above issue resolved by using -c 2D (if you are using 2D images) and add this argument during preprocess Ex: nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity -c 2d

hdnminh commented 9 months ago

Hi @revanb88

Do you know why we should add "-c" before 2D?

revanb88 commented 9 months ago

Hi @hdnminh -d arguments represents the List of dataset IDs. 001 is the dataset ID. check this cmd: nnUNetv2_plan_and_preprocess --help

hdnminh commented 9 months ago

Hi @revanb88, Sorry for typing wrong. I want to ask "-c" instead of "-d".

omaruus99 commented 9 months ago

@FabianIsensee @TaWald @hdnminh @revanb88 I've found a solution: enter this command to start train : nnUNet_n_proc_DA=0 nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer -device cuda

FabianIsensee commented 9 months ago

This is not a solution. It will destroy your training speed because this will run data augmentation as part of the main python process as opposed to background workers. You seem to have a problem with spawning/using those. Can you please get back to my question about shared memory size and --ipc=host when using docker?

omaruus99 commented 9 months ago

@FabianIsensee Yes, it works with --ipc=host , and the training speed is optimal :)

simonansm commented 5 months ago

Hi, I'm having the same issue using Anaconda virtual environment, and I have checked the space occupied by virtual env isn't crazy. Besides the problem mentioned above, it also seemed that I'm using m1 without Nvidia-cuda stuff, so my only choice of -device would be mps, but it is not supported.

` ############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0 /opt/anaconda3/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn(

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-05-19 03:51:16.831520: do_dummy_2d_data_aug: False 2024-05-19 03:51:16.833231: Using splits from existing split file: /Users/simonansm/nnUNet/nnUNetFrame/DATASET/nnUNet_preprocessed/Dataset001_BrainTumour/splits_final.json 2024-05-19 03:51:16.833469: The split file contains 5 splits. 2024-05-19 03:51:16.833508: Desired fold for training: 0 2024-05-19 03:51:16.833551: This split has 387 training and 97 validation cases. /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) /Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.) torch.set_num_threads(torch_nthreads) Traceback (most recent call last): File "/opt/anaconda3/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) ^^^^^^^^^^^^^^^^^^^^ File "/Users/simonansm/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/Users/simonansm/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 885, in on_train_start self.initialize() File "/Users/simonansm/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 217, in initialize ).to(self.device) ^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) ^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/lib/python3.11/site-packages/torch/cuda/init.py", line 293, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/anaconda3/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/opt/anaconda3/lib/python3.11/threading.py", line 982, in run self._target(*self._args, self._kwargs) File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message* Exception in thread Thread-1 (results_loop): Traceback (most recent call last): File "/opt/anaconda3/lib/python3.11/threading.py", line 1045, in _bootstrap_inner self.run() File "/opt/anaconda3/lib/python3.11/threading.py", line 982, in run self._target(self._args, self._kwargs) File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/anaconda3/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`**

Any suggested fix for these two problems? I would really appreciate it. @FabianIsensee

mathinfoia commented 5 months ago

Salut je rencontre le meme problème quelqu'un a 'il une solution merci d'avance : CUDA_VISIBLE_DEVICES=7 nnUNetv2_train 100 3d_fullres 3 --npz --c

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0 /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn(

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

WARNING: Cannot continue training because there seems to be no checkpoint available to continue from. Starting a new training... 2024-05-22 07:20:50.363845: do_dummy_2d_data_aug: False 2024-05-22 07:20:50.378673: Using splits from existing split file: /home/pyuser/data/nnUNet_preprocessed/Dataset100_Autopet/splits_final.json 2024-05-22 07:20:50.381199: The split file contains 5 splits. 2024-05-22 07:20:50.381271: Desired fold for training: 3 2024-05-22 07:20:50.381316: This split has 1291 training and 323 validation cases. Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper return fn(self, *args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu return super()._share_fdcpu(*args, *kwargs) RuntimeError: unable to write to file : No space left on device (28) Exception in thread Thread-1 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(self._args, self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/opt/conda/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/home/pyuser/wkdir/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/pyuser/wkdir/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/home/pyuser/wkdir/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1338, in run_training self.on_train_start() File "/home/pyuser/wkdir/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 882, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/home/pyuser/wkdir/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 676, in getdataloaders = next(mt_gen_val) File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

tommydino93 commented 5 months ago

Hi, Thanks for the great package! I'm also facing the RuntimeError: Triton Error when training from PyCharm terminal with a conda environment. Specifically, when running:

$ CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 200 2d 0 -p nnUNetResEncUNetMPlans

I also tried adding OMP_NUM_THREADS=1 before CUDA_VISIBLE_DEVICES=1 as suggested in other posts but the problem persists.

Any idea what I could try? Thanks a lot in advance!

Here's the traceback:

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-06-03 14:59:56.620792: do_dummy_2d_data_aug: False
2024-06-03 14:59:56.621422: Using splits from existing split file: /ssd/tdinoto/CVSnet_v3_TDN/CVSnet_v4_TDN_nnUnet/nnUNet_preprocessed/Dataset200_CVSNet/splits_final.json
2024-06-03 14:59:56.621546: The split file contains 5 splits.
2024-06-03 14:59:56.621580: Desired fold for training: 0
2024-06-03 14:59:56.621608: This split has 105 training and 27 validation cases.
using pin_memory on device 0
using pin_memory on device 0
2024-06-03 15:00:02.816137: Using torch.compile...
/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [112, 128, 160], 'median_image_size_in_voxels': [245.0, 253.0, 324.0], 'spacing': [0.5500007271766663, 0.5357142686843872, 0.5357142686843872], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset200_CVSNet', 'plans_name': 'nnUNetResEncUNetMPlans', 'original_median_spacing_after_transp': [0.5500007271766663, 0.5357142686843872, 0.5357142686843872], 'original_median_shape_after_transp': [245, 253, 324], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [2, 0, 1], 'transpose_backward': [1, 2, 0], 'experiment_planner_used': 'nnUNetPlannerResEncM', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 4483513319424.0, 'mean': 29516924928.0, 'median': 23683.400390625, 'min': 0.0, 'percentile_00_5': 7712.2470703125, 'percentile_99_5': 2136903647232.0, 'std': 228257710080.0}}} 

2024-06-03 15:00:04.229930: unpacking dataset...
2024-06-03 15:00:16.017693: unpacking done...
2024-06-03 15:00:16.019192: Unable to plot network architecture: nnUNet_compile is enabled!
2024-06-03 15:00:16.030707: 
2024-06-03 15:00:16.031127: Epoch 0
2024-06-03 15:00:16.031347: Current learning rate: 0.01
Traceback (most recent call last):
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 994, in train_step
    output = self.network(data)
             ^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame
    result = inner_convert(
             ^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
    return _compile(
           ^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
    transformations(instructions, code_options)
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform
    tracer.run()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
    super().run()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
        ^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 991, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1168, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1241, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1222, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/__init__.py", line 1729, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 352, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn
    return self.compile_to_module().call
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/graph.py", line 1254, in compile_to_module
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2160, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_tdinoto/eb/cebzbumgl7mtlnmmhj5cb64h7ahumyshvewflwgwtz6mvm64qz3n.py", line 3327, in <module>
    async_compile.wait(globals())
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2715, in wait
    scope[key] = result.result()
                 ^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2523, in result
    kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2499, in _load_kernel
    kernel.precompile()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 208, in precompile
    compiled_binary, launcher = self._precompile_config(
                                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 372, in _precompile_config
    binary._init_handles()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/triton/compiler/compiler.py", line 250, in _init_handles
    self.module, self.function, self.n_regs, self.n_spills = driver.utils.load_binary(
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
tommydino93 commented 5 months ago

Ah, actually by investigating a bit more I found a workaround which seems to do the job. If I run: $ TORCHDYNAMO_DISABLE=1 OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 200 2d 0 -p nnUNetResEncUNetMPlans training starts normally.

Hope this helps!

fitzjalen commented 4 months ago

Same problem (omaruus99). But looks like I can't handle --ipc=host as model deploys to the cloud and docker run executes in back. Is there any other options to fix this error? Its only works with nnUNet_n_proc_DA=0 but it takes about 10-13 min per epoch @FabianIsensee