3d_lowres Inference RuntimeError: Some background workers are no longer alive

chenney0830 commented 1 month ago

Hello,

I am currently encountering the following error, which occurs during the validation result generation after training 3D_lowres. No files have been generated in either the 'val' folder or the 'predict_from_next_stage' folder. Similarly, I tested 3D_lowres during inference and encountered the same error. According to my resource monitor, the memory has run out, but training and inference on 3D_fullres run smoothly without any issues.

Could you suggest how we might resolve this problem?

Thank you!

2024-05-14 23:55:29.948744: predicting 0001 2024-05-14 23:55:33.011472: predicting 0002 2024-05-14 23:55:35.686156: predicting 0003 2024-05-14 23:55:37.476846: predicting 0004 Traceback (most recent call last): File "/home/chenney/anaconda3/envs/nnUNet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training nnunet_trainer.perform_actual_validation(export_validation_probabilities) File "/home/chenney/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1168, in perform_actual_validation proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results, File "/home/chenney/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy raise RuntimeError('Some background workers are no longer alive') RuntimeError: Some background workers are no longer alive

YUjh0729 commented 3 weeks ago

Hello， I also encountered the same problem. After the model training was completed, this problem occurred when the model automatically validated itself. Additionally, when I used the nnUNetv2_predict command to validate the model, the same issue arose. Did you solve this problem? My error message is as follows:

2024-06-11 01:55:16.900306: predicting FLARE22_046 resizing data, order is 1 data shape (14, 113, 628, 628) 2024-06-11 01:55:26.419984: predicting FLARE22_047 resizing data, order is 1 data shape (14, 93, 465, 465) ------------------如果未准备好的结果数量大于可用工作进程数加上允许排队的数量，则返回 True，否则报错RuntimeError------------------- Traceback (most recent call last): File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve with self._listener.accept() as conn: File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 466, in accept answer_challenge(c, self._authkey) File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "/home/yjh/.conda/envs/umamba/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 208, in run_training nnunet_trainer.perform_actual_validation(export_validation_probabilities) File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1154, in perform_actual_validation proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results, File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/utilities/file_path_utilities.py", line 104, in check_workers_alive_and_busy raise RuntimeError('Some background workers are no longer alive') RuntimeError: Some background workers are no longer alive

YUjh0729 commented 3 weeks ago

Moreover, when training the model on another machine with a 4080 GPU in the same environment (the current machine uses a 3090), I sometimes encounter a deadlock situation (which I believe is a deadlock). The CPU and GPU memory are both occupied, but their utilization is around 1%, causing the training to get stuck and unable to progress at a certain epoch. Of course, I have no issues training the model on the 3090 machine, but during validation, I encountered the error "Some background workers are no longer alive."I checked the nnUNet issues and concluded that the CPU's RAM is full, but there is no useful solution available. It's frustrating!

MIC-DKFZ / nnUNet

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182