MIC-DKFZ / nnUNet

Apache License 2.0
5.79k stars 1.74k forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED&RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #1999

Closed zhaoawen closed 7 months ago

zhaoawen commented 7 months ago

Hi, I am a university student and I encountered this kind of problem during training, can you help me solve it?

nnUNetv2_train 040 3d_lowres 0 --npz Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

D:\software\python\envs\nnUNet\lib\site-packages\torch\optim\lr_scheduler.py:28: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn("The verbose parameter is deprecated. Please use get_last_lr() "

This is the configuration used by this training: Configuration name: 3d_lowres {'data_identifier': 'nnUNetPlans_3d_lowres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 2, 'patch_size': [128, 128, 128], 'median_image_size_in_voxels': [204, 199, 199], 'spacing': [2.0118091537065514, 2.0117834028789936, 2.0117834028789936], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2], 'num_pool_per_axis': [5, 5, 5], 'pool_op_kernel_sizes': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'conv_kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'unet_max_num_features': 320, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': False, 'next_stage': '3d_cascade_fullres'}

These are the global plan.json settings: {'dataset_name': 'Dataset040_KiTS', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [3.0, 0.78125, 0.78125], 'original_median_shape_after_transp': [108, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [2, 0, 1], 'transpose_backward': [1, 2, 0], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3071.0, 'mean': 102.5714111328125, 'median': 103.0, 'min': -1015.0, 'percentile_00_5': -75.0, 'percentile_99_5': 295.0, 'std': 73.64986419677734}}}

2024-03-10 11:45:54.985133: unpacking dataset... 2024-03-10 11:45:55.586289: unpacking done... 2024-03-10 11:45:55.588306: do_dummy_2d_data_aug: False 2024-03-10 11:45:55.590337: Creating new 5-fold cross-validation split... 2024-03-10 11:45:55.594905: Desired fold for training: 0 2024-03-10 11:45:55.595949: This split has 84 training and 22 validation cases. 2024-03-10 11:45:55.743635: Unable to plot network architecture: 2024-03-10 11:45:55.745655: No module named 'IPython' 2024-03-10 11:45:55.880500: 2024-03-10 11:45:55.882542: Epoch 0 2024-03-10 11:45:55.884908: Current learning rate: 0.01 using pin_memory on device 0 Traceback (most recent call last): File "D:\software\python\envs\nnUNet\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\software\python\envs\nnUNet\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "D:\software\python\envs\nnUNet\Scripts\nnUNetv2_train.exe__main.py", line 7, in File "C:\Users\25416\nnUNet\nnunetv2\run\run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "C:\Users\25416\nnUNet\nnunetv2\run\run_training.py", line 204, in run_training nnunet_trainer.run_training() File "C:\Users\25416\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1275, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "C:\Users\25416\nnUNet\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 908, in train_step self.grad_scaler.scale(l).backward() File "D:\software\python\envs\nnUNet\lib\site-packages\torch_tensor.py", line 522, in backward torch.autograd.backward( File "D:\software\python\envs\nnUNet\lib\site-packages\torch\autograd\init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED Exception in thread Thread-4: Traceback (most recent call last): File "D:\software\python\envs\nnUNet\lib\threading.py", line 980, in _bootstrap_inner self.run() File "D:\software\python\envs\nnUNet\lib\threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "D:\software\python\envs\nnUNet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "D:\software\python\envs\nnUNet\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Please help me, thanks, much appreciated

aymuos15 commented 7 months ago

What is your VRAM capacity?

This may be an OOM for which cuda/torch is throwing a vague error.

zhaoawen commented 7 months ago

您的VRAM容量是多少?

这可能是 cuda/torch 抛出模糊错误的 OOM。

Thank you very much, I'm trying to use Kaggle's cloud server to train, good luck with me~

wacyfdyy commented 7 months ago

I'm also experiencing this issue, please how to fix it?

aymuos15 commented 7 months ago

@zhaoawen I remeber Kaggle having a 12GB VRAM limit (I dont know if they have a paid tier).

'For training a GPU with at least 10 GB (popular non-datacenter options are the RTX 2080ti, RTX 3080/3090 or RTX 4080/4090) is required. If using a GPU, it should have at least 4 GB of available (unused) VRAM.'

It shouldn't work to be honest. But may be also your volumes are too big? For a volume of size [108, 512, 512], you will need something along the lines of 40GB VRAM. I do not think kaggle will allow that on the free tier.

@wacyfdyy Unfortunately the only fix here is getting access to more VRAM. Unless you heavily lower the patch size.

Guideline on patchsize: 'The patch size is optimized in conjunction with the batch size during the plan and preprocess step for optimal performance at the memory target (~10GB of VRAM). In the dataloader nnUNet extracts random patches of the data, where foreground oversampling is applied to ensure a decent coverage of foreground in the images (foreground can often be <1% of the total image, especially in 3d).'

From: https://github.com/MIC-DKFZ/nnUNet/issues/1975

(Again, CUDA errors are hard to debug. I say this comment and the before based on my experience. I may be wrong!).

ykirchhoff commented 7 months ago

Hi @zhaoawen and @wacyfdyy,

were you able to solve your issue by lowering the patch size or with any other modification? To me this doesn't look like a GPU OOM problem, this usually gives a pretty clear error message. And with 12GB on Kaggle, you shouldn't have problems with VRAM. Most of the time, this error from batchgenerators indicates RAM OOM, which can be solved by reducing the number of workers by setting nnUNet_n_proc_DA.

Best, Yannick

zhaoawen commented 7 months ago

Hi @zhaoawen and @wacyfdyy,

were you able to solve your issue by lowering the patch size or with any other modification? To me this doesn't look like a GPU OOM problem, this usually gives a pretty clear error message. And with 12GB on Kaggle, you shouldn't have problems with VRAM. Most of the time, this error from batchgenerators indicates RAM OOM, which can be solved by reducing the number of workers by setting nnUNet_n_proc_DA.

Best, Yannick Thank you very much for your answer, which is very helpful.

zhaoawen commented 7 months ago

I'm also experiencing this issue, please how to fix it?

"Most of the time, this error from batchgenerators indicates RAM OOM, which can be solved by reducing the number of workers by setting nnUNet_n_proc_DA." Please try this method?

zhaoawen commented 7 months ago

@zhaoawen I remeber Kaggle having a 12GB VRAM limit (I dont know if they have a paid tier).

'For training a GPU with at least 10 GB (popular non-datacenter options are the RTX 2080ti, RTX 3080/3090 or RTX 4080/4090) is required. If using a GPU, it should have at least 4 GB of available (unused) VRAM.'

It shouldn't work to be honest. But may be also your volumes are too big? For a volume of size [108, 512, 512], you will need something along the lines of 40GB VRAM. I do not think kaggle will allow that on the free tier.

@wacyfdyy Unfortunately the only fix here is getting access to more VRAM. Unless you heavily lower the patch size.

Guideline on patchsize: 'The patch size is optimized in conjunction with the batch size during the plan and preprocess step for optimal performance at the memory target (~10GB of VRAM). In the dataloader nnUNet extracts random patches of the data, where foreground oversampling is applied to ensure a decent coverage of foreground in the images (foreground can often be <1% of the total image, especially in 3d).'

From: #1975

(Again, CUDA errors are hard to debug. I say this comment and the before based on my experience. I may be wrong!).

For now, my data set is small, and the kaggle platform can meet my training requirements, but it has some limitations, only 30 hours of GPU acceleration per week.Thank you very much for your help.