Out-of-Memory Kill Event during validation and prediction after successful training

BramRoumen commented 5 months ago

Hello!

Thank you very much for sharing nnUNet with us! I'm currently experiencing an issue with two trained nnUNetv2 3d-fullres models on a custom dataset (median image shape 481x681x681). Despite successful training, the process encounters an out-of-memory (OOM) kill event during the final validation stage, regardless of whether the --npz flag is enabled or disabled. The same OOM issue arises during prediction with the trained models. Previously, I successfully trained, validated, and tested another nnUNetv2 3d-fullres model on this dataset. The only difference between the successful model and the two that are failing is the number of labels. The successful model was trained with 5 labels, while the two failing models were trained for 11 and 33 labels respectively.

I work in a cluster environment with 8 x Nvidia A40 48GB GPUs and 504GB of RAM available. My slurmscript specifies --ntasks=6 and --mem=52G. I have even tried --ntasks 1 and --mem 256G. Yet it runs out of memory.

This is the error message printed in the error log:

Traceback (most recent call last):
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 274, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/run/run_training.py", line 214, in run_training
    nnunet_trainer.perform_actual_validation(export_validation_probabilities)
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1190, in perform_actual_validation
    proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
    raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive
slurmstepd: error: Detected 11 oom-kill event(s) in step 354129.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

This seems to me as an OOM-kill event because of CPU RAM. The problem might be related to #417. During the validation I monitored the GPU usage using nvidia-smi and it peaked at 18.6GB.

Here are the versions of the relevant software I'm using: Python: 3.11.5 nnUNet: 2.4.2 CUDA: 11.6

Thanks in advance!

BramRoumen commented 5 months ago

I'm still encountering out-of-memory issues during external validation of the successfully trained and tested model using the MICCAI 2024 ToothFairy2 challenge dataset (ToothFairy2 Dataset).

The conversion of the .mha files to .nii format was done in a similar manner to this gist. The maximum size of the MICCAI image files is 80MB, while my custom dataset contains images up to 270MB, 240MB on average.

I've attempted predictions on 10 and 48 images from the MICCAI dataset, using similar --mem and --ntasks parameters as previously described. However, both jobs ran out of memory.

Here is the complete output- and errorlog of the run with 10 images: outputlog:

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

There are 10 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 10 cases that I would like to predict

errorlog:

Traceback (most recent call last):
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/bin/nnUNetv2_predict", line 8, in <module>
    sys.exit(predict_entry_point())
             ^^^^^^^^^^^^^^^^^^^^^
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 864, in predict_entry_point
    predictor.predict_from_files(args.i, args.o, save_probabilities=args.save_probabilities,
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 256, in predict_from_files
    return self.predict_from_data_iterator(data_iterator, save_probabilities, num_processes_segmentation_export)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 349, in predict_from_data_iterator
    for preprocessed in data_iterator:
  File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/data_iterators.py", line 111, in preprocessing_iterator_fromfiles
    raise RuntimeError('Background workers died. Look for the error message further up! If there is '
RuntimeError: Background workers died. Look for the error message further up! If there is none then your RAM was full and the worker was killed by the OS. Use fewer workers or get more RAM in that case!
slurmstepd: error: Detected 1 oom-kill event(s) in step 354763.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I'm uncertain whether the issue originates from our end or is related to nnUNetv2. Any insights or suggestions would be greatly appreciated.

JSJ666 commented 4 months ago

I encountered the same issue using the ToothFairy2 dataset.

BramRoumen commented 4 months ago

Hi,

I managed to resolve the issue by resampling and resizing the images to match the original training set. Specifically, I resampled to an isotropic voxel spacing of 0.25mm and adjusted the shape to 481x681x681.

I hope it works for you, too!

JSJ666 commented 4 months ago

Hi,

I managed to resolve the issue by resampling and resizing the images to match the original training set. Specifically, I resampled to an isotropic voxel spacing of 0.25mm and adjusted the shape to 481x681x681.

I hope it works for you, too!

Thanks a lot！

MIC-DKFZ / nnUNet

Out-of-Memory Kill Event during validation and prediction after successful training #2261