MIC-DKFZ / nnUNet

Apache License 2.0
5.95k stars 1.77k forks source link

Memory Leak During Inference? #2522

Closed andy-s-ding closed 1 month ago

andy-s-ding commented 1 month ago

Hi, I had a question about inference. I trained nnUNet on a set of high resolution CT scans (512x512x512, 0.1mm^3/voxel), and inference has worked well with other CT scans of the same resolution and size. I recently tried to run inference on a CT scan obtained from a C-Arm (300x300x300, 0.5mm^3/voxel), which has a significantly smaller file size, but the process hangs before it even starts inference. Checking my System Monitor, I see that over the course of a couple hours, memory usage slowly creeps up until the process crashes completely.

I am running this on Ubuntu 22.04 LTS. File formats are .nii.gz. Below is the output from my Terminal. Any advice on this? Thank you!

(nnunetv2) andyding@andyding-Alienware-Aurora-R11:~/tbone-seg-nnunetv2$ nnUNetv2_predict -d Dataset101_TemporalBone -i /home/andyding/tbone-seg-nnunetv2/00_nnUNetv2_baseline_retrain/nnUNet_raw/Dataset101_TemporalBone/test_cadaver -o /home/andyding/tbone-seg-nnunetv2/00_nnUNetv2_baseline_retrain/nnUNet_raw/Dataset101_TemporalBone/test_cadaver/results -f  0 1 2 3 4 -tr nnUNetTrainer_300epochs -c 3d_fullres -p nnUNetPlans --verbose

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

/home/andyding/tbone-seg-nnunetv2/nnUNet/nnunetv2/inference/predict_from_raw_data.py:84: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(join(model_training_output_dir, f'fold_{f}', checkpoint_name),
There are 1 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 1 cases that I would like to predict
lin-tianyu commented 1 month ago

Same issue here, literally caused my server to shut down :(

andy-s-ding commented 1 month ago

Just wanted to follow up. It seems that I was wrong to assume that the input file size was the main factor for potential memory issues. It is actually the spatial dimensions of the input. Inference required predictions of 630 patches, then had to be resampled for export. The resampling step seems to be the culprit, as outlined in Issue #2192.

Rather than breaking up my image into chunks, I used SimpleITK to resample my input image to the original_median_spacing_after_transp as outlined in the output plans.json file (luckily this was created before the process crashed).

Inference on this resampled image completed without issue. I will close this issue, as it is similar to #2192.

abhishekpatil32 commented 1 month ago

Hello @andy-s-ding,

The model that I am using has a very low spacing in the plans.json.

The "original_median_spacing_after_transp": [0.43164101243019104, 0.31200000643730164, 0.43164101243019104 ]"

When I try to convert the file to this spacing, the file size is huge around >2Gb and further to this I get memory issues. Is there a way to work around this issue??