MIC-DKFZ / nnUNet

Apache License 2.0
5.57k stars 1.7k forks source link

Meet with :RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message #2238

Open Huijin13 opened 3 months ago

Huijin13 commented 3 months ago

I meet with the problem called: 2024-05-29 01:06:15.304361: do_dummy_2d_data_aug: False 2024-05-29 01:06:15.304548: Using splits from existing split file: nnUNet_preprocessed/Dataset059_10%HRF/splits_final.json 2024-05-29 01:06:15.304609: The split file contains 5 splits. 2024-05-29 01:06:15.304630: Desired fold for training: 1 2024-05-29 01:06:15.304647: This split has 4 training and 1 validation cases. using pin_memory on device 0 Exception in thread Thread-1: Traceback (most recent call last): File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 110, in results_loop item = pin_memory_of_all_eligible_items_in_dict(item) File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 80, in pin_memory_of_all_eligible_items_in_dict result_dict[k] = result_dict[k].pin_memory() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "/home/xly/anaconda3/envs/Sammed/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 274, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 210, in run_training nnunet_trainer.run_training() File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1287, in run_training self.on_train_start() File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 831, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 655, in getdataloaders = next(mt_gen_train) File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/home/xly/anaconda3/envs/Sammed/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

I tried to change a variable called self.num_epochs in nnUnetTrainer but i failed ,then i changed the parameter back. but met with this problem, the project could run before i changed the variable

FabianIsensee commented 3 months ago

Seems like you are simply out of memory on your GPU.

RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

If you are using the GPU in your workstation which is also driving the displays then it could be that other tasks take up memory causing the training to crash. Check nvidia-smi