MIC-DKFZ / nnUNet

Apache License 2.0
5.33k stars 1.63k forks source link

RuntimeError: One or more background workers are no longer alive. #2297

Open NastaranVB opened 1 week ago

NastaranVB commented 1 week ago

Hi! I'm facing RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message. when trying to run training. To train the model on cluster I use the following commands in the Linux's terminal of my cluster:

CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 300 3d_fullres 0 -tr nnUNetTrainer_250epochs

The error that I get when using the command is as below:

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-06-17 16:37:10.754485: do_dummy_2d_data_aug: True
2024-06-17 16:37:10.754877: Creating new 5-fold cross-validation split...
2024-06-17 16:37:10.755827: Desired fold for training: 0
2024-06-17 16:37:10.755879: This split has 39 training and 10 validation cases.
using pin_memory on device 0
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/bin/nnUNetv2_train", line 33, in <module>
    sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/mnt/nvme0n1p1/scratchnnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
    self.on_train_start()
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
    self.dataloader_train, self.dataloader_val = self.get_dataloaders()
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
    _ = next(mt_gen_train)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Any assistance to solve this error would be greatly appreciated.

NastaranVB commented 1 week ago

Hi @TaWald,

I wanted to follow up on an inquiry I made last week. Since you receive many emails daily, I thought it might have been missed. Could you please revisit my question and assist ? Any help would be appreciated. Warm regards.

kndahl commented 6 days ago

Same for me. I have A100 cluster, 28 cores, 119 RAM. it works with number of precesses 4, but died if I set more. Usage of RAM for 4 processes is about 17gb-20gb. I also use GeeseFS to mount drive with data.