RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation

Hi,

I wanted to share an issue regarding validation cases. I am using nnUNet on the HaNSeg dataset, I trained my model on all 5 folds using a custom configuration with no issues, and the training logs do not indicate any problems in training or predicting the validation cases. Needless to say, the validation cases did not save (or only some are available/saved), and I am running the validation cases using the "nnUNetv2_train --val." It works for some of the cases, but usually crashes before reaching the end, it is very expensive computationally and often switches to the CPU. I decided to create a new configuration and ran the dataset through preprocessing for this new configuration, then transferred the model files to the new configuration, and ran it through. Still get the crash, but the predictions stay on the GPU. This applies for other folds as well.

Any help would be appreciated!

Here is my input and output.

CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 999 3d_fullres_v3 4 --val Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-11-15 14:51:26.293600: Using splits from existing split file: /data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/splits_final.json 2024-11-15 14:51:26.302730: The split file contains 5 splits. 2024-11-15 14:51:26.302880: Desired fold for training: 4 2024-11-15 14:51:26.302988: This split has 34 training and 6 validation cases. 2024-11-15 14:51:26.303285: predicting case_02 2024-11-15 14:51:26.643660: case_02, shape torch.Size([1, 136, 466, 466]), rank 0 2024-11-15 14:53:46.883094: predicting case_37 2024-11-15 14:53:47.123915: case_37, shape torch.Size([1, 124, 385, 385]), rank 0 2024-11-15 14:54:49.823392: predicting case_38 2024-11-15 14:54:50.046503: case_38, shape torch.Size([1, 136, 357, 357]), rank 0 2024-11-15 14:56:14.719611: predicting case_39 2024-11-15 14:56:15.031698: case_39, shape torch.Size([1, 135, 425, 425]), rank 0 2024-11-15 14:58:25.082591: predicting case_40 2024-11-15 14:58:25.407599: case_40, shape torch.Size([1, 126, 400, 400]), rank 0 Traceback (most recent call last): File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve with self._listener.accept() as conn: File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 466, in accept answer_challenge(c, self._authkey) File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/home/mtw2156/anaconda3/envs/env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File "/home/mtw2156/anaconda3/envs/env/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training nnunet_trainer.perform_actual_validation(export_validation_probabilities) File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1183, in perform_actual_validation proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results, File "/data/intern/Matt_Wilson_2024/nnUnet_preprocessed/Dataset999_HaNSeg/nnUNet/nnUNet/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy raise RuntimeError('Some background workers are no longer alive') RuntimeError: Some background workers are no longer alive

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases. 2024-11-21 15:41:56.821133: predicting case_196_0004 2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0 2024-11-21 15:43:39.985436: predicting case_196_0007 2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0 2024-11-21 15:45:33.126272: predicting case_196_0009 2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:46:56.187559: predicting case_196_0010 2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:48:36.251001: predicting case_196_0019 2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0 2024-11-21 15:50:14.743588: predicting case_196_0026 2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0 2024-11-21 15:51:34.419982: predicting case_196_0027 2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0 2024-11-21 15:54:19.972141: predicting case_196_0031 2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0 Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU 2024-11-21 16:02:21.664565: predicting case_196_0033 2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles". W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

Additionally for reference,

Hello,

I am also experiencing the issue mentioned above. Here is a part of my logs for context :

2024-11-21 15:41:56.820905: This split has 157 training and 39 validation cases. 2024-11-21 15:41:56.821133: predicting case_196_0004 2024-11-21 15:41:56.823359: case_196_0004, shape torch.Size([1, 448, 946, 448]), rank 0 2024-11-21 15:43:39.985436: predicting case_196_0007 2024-11-21 15:43:40.018088: case_196_0007, shape torch.Size([1, 512, 755, 512]), rank 0 2024-11-21 15:45:33.126272: predicting case_196_0009 2024-11-21 15:45:33.154442: case_196_0009, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:46:56.187559: predicting case_196_0010 2024-11-21 15:46:56.204679: case_196_0010, shape torch.Size([1, 575, 399, 575]), rank 0 2024-11-21 15:48:36.251001: predicting case_196_0019 2024-11-21 15:48:36.273754: case_196_0019, shape torch.Size([1, 512, 647, 512]), rank 0 2024-11-21 15:50:14.743588: predicting case_196_0026 2024-11-21 15:50:14.771919: case_196_0026, shape torch.Size([1, 467, 642, 467]), rank 0 2024-11-21 15:51:34.419982: predicting case_196_0027 2024-11-21 15:51:34.440981: case_196_0027, shape torch.Size([1, 636, 722, 636]), rank 0 2024-11-21 15:54:19.972141: predicting case_196_0031 2024-11-21 15:54:20.014876: case_196_0031, shape torch.Size([1, 447, 2073, 447]), rank 0 Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU 2024-11-21 16:02:21.664565: predicting case_196_0033 2024-11-21 16:02:21.720253: case_196_0033, shape torch.Size([1, 639, 940, 639]), rank 0 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] torch._dynamo hit config.cache_size_limit (8) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] function: 'forward' (/home/user/anaconda3/envs/nnunet-2.4/lib/python3.10/site-packages/dynamic_network_architectures/architectures/unet.py:116) W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] last reason: tensor 'L['x']' stride mismatch at index 0. expected 189865984, actual 383821740 W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To log all recompilation reasons, use TORCH_LOGS="recompiles". W1121 16:02:22.355000 139797994328448 torch/_dynamo/convert_frame.py:357] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU

It seems like the memory issue is forcing the process to switch to CPU, and I am also seeing warnings related to torch._dynamo, what could it be?

Thank you in advance !!

I have also seen warnings related to torch._dynamo!

Thanks to anyone who provides help!!

MIC-DKFZ / nnUNet

RuntimeError: "Some background workers are no longer alive" during nnUNetv2_train Validation #2607