Closed BramRoumen closed 4 months ago
I'm still encountering out-of-memory issues during external validation of the successfully trained and tested model using the MICCAI 2024 ToothFairy2 challenge dataset (ToothFairy2 Dataset).
The conversion of the .mha
files to .nii
format was done in a similar manner to this gist. The maximum size of the MICCAI image files is 80MB, while my custom dataset contains images up to 270MB, 240MB on average.
I've attempted predictions on 10 and 48 images from the MICCAI dataset, using similar --mem
and --ntasks
parameters as previously described. However, both jobs ran out of memory.
Here is the complete output- and errorlog of the run with 10 images: outputlog:
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
There are 10 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 10 cases that I would like to predict
errorlog:
Traceback (most recent call last):
File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/bin/nnUNetv2_predict", line 8, in <module>
sys.exit(predict_entry_point())
^^^^^^^^^^^^^^^^^^^^^
File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 864, in predict_entry_point
predictor.predict_from_files(args.i, args.o, save_probabilities=args.save_probabilities,
File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 256, in predict_from_files
return self.predict_from_data_iterator(data_iterator, save_probabilities, num_processes_segmentation_export)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/predict_from_raw_data.py", line 349, in predict_from_data_iterator
for preprocessed in data_iterator:
File "/trinity/home/r060801/venv_CMF_nnUNetv2_py3.11/lib/python3.11/site-packages/nnunetv2/inference/data_iterators.py", line 111, in preprocessing_iterator_fromfiles
raise RuntimeError('Background workers died. Look for the error message further up! If there is '
RuntimeError: Background workers died. Look for the error message further up! If there is none then your RAM was full and the worker was killed by the OS. Use fewer workers or get more RAM in that case!
slurmstepd: error: Detected 1 oom-kill event(s) in step 354763.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I'm uncertain whether the issue originates from our end or is related to nnUNetv2. Any insights or suggestions would be greatly appreciated.
I encountered the same issue using the ToothFairy2 dataset.
Hi,
I managed to resolve the issue by resampling and resizing the images to match the original training set. Specifically, I resampled to an isotropic voxel spacing of 0.25mm and adjusted the shape to 481x681x681.
I hope it works for you, too!
Hi,
I managed to resolve the issue by resampling and resizing the images to match the original training set. Specifically, I resampled to an isotropic voxel spacing of 0.25mm and adjusted the shape to 481x681x681.
I hope it works for you, too!
Thanks a lot!
Hello!
Thank you very much for sharing nnUNet with us! I'm currently experiencing an issue with two trained nnUNetv2 3d-fullres models on a custom dataset (median image shape 481x681x681). Despite successful training, the process encounters an out-of-memory (OOM) kill event during the final validation stage, regardless of whether the
--npz
flag is enabled or disabled. The same OOM issue arises during prediction with the trained models. Previously, I successfully trained, validated, and tested another nnUNetv2 3d-fullres model on this dataset. The only difference between the successful model and the two that are failing is the number of labels. The successful model was trained with 5 labels, while the two failing models were trained for 11 and 33 labels respectively.I work in a cluster environment with 8 x Nvidia A40 48GB GPUs and 504GB of RAM available. My slurmscript specifies
--ntasks=6
and--mem=52G
. I have even tried--ntasks 1
and--mem 256G
. Yet it runs out of memory.This is the error message printed in the error log:
This seems to me as an OOM-kill event because of CPU RAM. The problem might be related to #417. During the validation I monitored the GPU usage using
nvidia-smi
and it peaked at 18.6GB.Here are the versions of the relevant software I'm using: Python: 3.11.5 nnUNet: 2.4.2 CUDA: 11.6
Thanks in advance!