Closed omaruus99 closed 9 months ago
Does this only happen in the docker and not on your local machine? It definetly seems like you are unable to load the images properly with your dataloaders. Usually the error in multiprocessing can be very obfuscated. I would recommend you to replace the multi processed dataloaded with a SingleThreadedAugmenter, as it will provide you and me with more information on what the reason here is.
Have you added --ipc=host
or increased the shared memory amount?
Hi, The above issue resolved by using -c 2D (if you are using 2D images) and add this argument during preprocess Ex: nnUNetv2_plan_and_preprocess -d 001 --verify_dataset_integrity -c 2d
Hi @revanb88
Do you know why we should add "-c" before 2D?
Hi @hdnminh -d arguments represents the List of dataset IDs. 001 is the dataset ID. check this cmd: nnUNetv2_plan_and_preprocess --help
Hi @revanb88, Sorry for typing wrong. I want to ask "-c" instead of "-d".
@FabianIsensee @TaWald @hdnminh @revanb88 I've found a solution: enter this command to start train : nnUNet_n_proc_DA=0 nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer -device cuda
This is not a solution. It will destroy your training speed because this will run data augmentation as part of the main python process as opposed to background workers. You seem to have a problem with spawning/using those. Can you please get back to my question about shared memory size and --ipc=host
when using docker?
@FabianIsensee Yes, it works with --ipc=host , and the training speed is optimal :)
Hi, I'm having the same issue using Anaconda virtual environment, and I have checked the space occupied by virtual env isn't crazy. Besides the problem mentioned above, it also seemed that I'm using m1 without Nvidia-cuda stuff, so my only choice of -device would be mps, but it is not supported.
` ############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################
Using device: cuda:0 /opt/anaconda3/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn(
####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################
2024-05-19 03:51:16.831520: do_dummy_2d_data_aug: False
2024-05-19 03:51:16.833231: Using splits from existing split file: /Users/simonansm/nnUNet/nnUNetFrame/DATASET/nnUNet_preprocessed/Dataset001_BrainTumour/splits_final.json
2024-05-19 03:51:16.833469: The split file contains 5 splits.
2024-05-19 03:51:16.833508: Desired fold for training: 0
2024-05-19 03:51:16.833551: This split has 387 training and 97 validation cases.
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
/Users/simonansm/nnUNet/nnunetv2/training/dataloading/data_loader_2d.py:107: UserWarning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1711403251597/work/aten/src/ATen/ParallelNative.cpp:228.)
torch.set_num_threads(torch_nthreads)
Traceback (most recent call last):
File "/opt/anaconda3/bin/nnUNetv2_train", line 8, in
Any suggested fix for these two problems? I would really appreciate it. @FabianIsensee
Salut je rencontre le meme problème quelqu'un a 'il une solution merci d'avance : CUDA_VISIBLE_DEVICES=7 nnUNetv2_train 100 3d_fullres 3 --npz --c
############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################
Using device: cuda:0 /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:131: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling. warnings.warn(
####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################
WARNING: Cannot continue training because there seems to be no checkpoint available to continue from. Starting a new training...
2024-05-22 07:20:50.363845: do_dummy_2d_data_aug: False
2024-05-22 07:20:50.378673: Using splits from existing split file: /home/pyuser/data/nnUNet_preprocessed/Dataset100_Autopet/splits_final.json
2024-05-22 07:20:50.381199: The split file contains 5 splits.
2024-05-22 07:20:50.381271: Desired fold for training: 3
2024-05-22 07:20:50.381316: This split has 1291 training and 323 validation cases.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage
fd, size = storage._share_fdcpu()
File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 304, in wrapper
return fn(self, *args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/storage.py", line 374, in _share_fdcpu
return super()._share_fdcpu(*args, *kwargs)
RuntimeError: unable to write to file : No space left on device (28)
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(self._args, self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/opt/conda/lib/python3.10/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
File "/opt/conda/bin/nnUNetv2_train", line 8, in
Hi, Thanks for the great package! I'm also facing the RuntimeError: Triton Error when training from PyCharm terminal with a conda environment. Specifically, when running:
$ CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 200 2d 0 -p nnUNetResEncUNetMPlans
I also tried adding OMP_NUM_THREADS=1
before CUDA_VISIBLE_DEVICES=1
as suggested in other posts but the problem persists.
Any idea what I could try? Thanks a lot in advance!
Here's the traceback:
Using device: cuda:0
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
2024-06-03 14:59:56.620792: do_dummy_2d_data_aug: False
2024-06-03 14:59:56.621422: Using splits from existing split file: /ssd/tdinoto/CVSnet_v3_TDN/CVSnet_v4_TDN_nnUnet/nnUNet_preprocessed/Dataset200_CVSNet/splits_final.json
2024-06-03 14:59:56.621546: The split file contains 5 splits.
2024-06-03 14:59:56.621580: Desired fold for training: 0
2024-06-03 14:59:56.621608: This split has 105 training and 27 validation cases.
using pin_memory on device 0
using pin_memory on device 0
2024-06-03 15:00:02.816137: Using torch.compile...
/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "
This is the configuration used by this training:
Configuration name: 3d_fullres
{'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [112, 128, 160], 'median_image_size_in_voxels': [245.0, 253.0, 324.0], 'spacing': [0.5500007271766663, 0.5357142686843872, 0.5357142686843872], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [1, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}, 'deep_supervision': True}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': True}
These are the global plan.json settings:
{'dataset_name': 'Dataset200_CVSNet', 'plans_name': 'nnUNetResEncUNetMPlans', 'original_median_spacing_after_transp': [0.5500007271766663, 0.5357142686843872, 0.5357142686843872], 'original_median_shape_after_transp': [245, 253, 324], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [2, 0, 1], 'transpose_backward': [1, 2, 0], 'experiment_planner_used': 'nnUNetPlannerResEncM', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 4483513319424.0, 'mean': 29516924928.0, 'median': 23683.400390625, 'min': 0.0, 'percentile_00_5': 7712.2470703125, 'percentile_99_5': 2136903647232.0, 'std': 228257710080.0}}}
2024-06-03 15:00:04.229930: unpacking dataset...
2024-06-03 15:00:16.017693: unpacking done...
2024-06-03 15:00:16.019192: Unable to plot network architecture: nnUNet_compile is enabled!
2024-06-03 15:00:16.030707:
2024-06-03 15:00:16.031127: Epoch 0
2024-06-03 15:00:16.031347: Current learning rate: 0.01
Traceback (most recent call last):
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/bin/nnUNetv2_train", line 8, in <module>
sys.exit(run_training_entry())
^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 994, in train_step
output = self.network(data)
^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame
result = inner_convert(
^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
return _compile(
^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
out_code = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
transformations(instructions, code_options)
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform
tracer.run()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
super().run()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
and self.step()
^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
getattr(self, inst.opname)(inst)
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE
self.output.compile_subgraph(
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 991, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1168, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1241, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1222, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/__init__.py", line 1729, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx
return aot_autograd(
^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 352, in aot_dispatch_autograd
compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base
return inner_compile(
^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/debug.py", line 304, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner
compiled_graph = fx_codegen_and_compile(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn
return self.compile_to_module().call
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/graph.py", line 1254, in compile_to_module
mod = PyCodeCache.load_by_key_path(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2160, in load_by_key_path
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_tdinoto/eb/cebzbumgl7mtlnmmhj5cb64h7ahumyshvewflwgwtz6mvm64qz3n.py", line 3327, in <module>
async_compile.wait(globals())
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2715, in wait
scope[key] = result.result()
^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2523, in result
kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 2499, in _load_kernel
kernel.precompile()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 208, in precompile
compiled_binary, launcher = self._precompile_config(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 372, in _precompile_config
binary._init_handles()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/triton/compiler/compiler.py", line 250, in _init_handles
self.module, self.function, self.n_regs, self.n_spills = driver.utils.load_binary(
^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Triton Error [CUDA]: device kernel image is invalid
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/home/users/tdinoto/miniconda3/envs/nnunet_2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Ah, actually by investigating a bit more I found a workaround which seems to do the job. If I run:
$ TORCHDYNAMO_DISABLE=1 OMP_NUM_THREADS=1 CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 200 2d 0 -p nnUNetResEncUNetMPlans
training starts normally.
Hope this helps!
Same problem (omaruus99).
But looks like I can't handle --ipc=host
as model deploys to the cloud and docker run executes in back.
Is there any other options to fix this error? Its only works with nnUNet_n_proc_DA=0 but it takes about 10-13 min per epoch
@FabianIsensee
Hello @FabianIsensee I get this error when I run training inside a docker container: root@cc5d09285d9b:/nnunet# nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer Using device: cuda:0
####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################
This is the configuration used by this training: Configuration name: 3d_fullres {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 9, 'patch_size': [40, 56, 40], 'median_image_size_in_voxels': [36.0, 50.0, 35.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2], 'num_pool_per_axis': [3, 3, 3], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False}
These are the global plan.json settings: {'dataset_name': 'Dataset004_Hippocampus', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [36, 50, 35], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 486420.21875, 'mean': 22360.326171875, 'median': 362.88250732421875, 'min': 0.0, 'percentile_00_5': 28.0, 'percentile_99_5': 277682.03125, 'std': 60656.1328125}}}
2024-01-16 09:46:10.106258: unpacking dataset... 2024-01-16 09:46:22.449987: unpacking done... 2024-01-16 09:46:22.480390: do_dummy_2d_data_aug: False 2024-01-16 09:46:22.521797: Creating new 5-fold cross-validation split... 2024-01-16 09:46:22.581434: Desired fold for training: 0 2024-01-16 09:46:22.603892: This split has 208 training and 52 validation cases. 2024-01-16 09:46:23.022059: Unable to plot network architecture: 2024-01-16 09:46:23.042216: No module named 'hiddenlayer' 2024-01-16 09:46:23.131182: 2024-01-16 09:46:23.149663: Epoch 0 2024-01-16 09:46:23.171872: Current learning rate: 0.01 using pin_memory on device 0 Exception in thread Thread-5: Traceback (most recent call last): File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/usr/local/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Traceback (most recent call last): File "/usr/local/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/nnunet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/nnunet/nnunetv2/run/run_training.py", line 204, in run_training
nnunet_trainer.run_training()
File "/nnunet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1275, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
File "/usr/local/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
For information : i have the same error when i use : OMP_NUM_THREADS=1 nnUNetv2_train 004 3d_fullres 0 -tr nnUNetTrainer