Inferencer freezes after finishing the prediction

mazensoufi commented 9 months ago

Hi Fabian, First, thanks a lot for your amazing tool!

I'm facing an issue with the inference, using nnunetv2. I'm running it in 3D full resolution mode, but the inference is stopping after issuing the "done with xxx.nii.gz" message. The node is still running, but no output in the predictions folder. Other cases were segmented properly, facing this only with a specific case. Tried it at several nodes (different GPUs, the one below is the result by a node with RTX A6000 (RAM 128 GB, CPU:Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz, 12 cores allocated for inference).

My log is here (I'm outputting some intermediate variables for debugging):


Namespace(dataset_id=21, model_type=['3d_fullres'], fold=4, prepare=False, nnUNet_raw=xxxx nnUNet_preprocessed=xxxxx. nnUNet_results=xxxxx, indir=None, outdir=None, datalist=None, seed=0, gpu=0)
There are 1 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 1 cases that I would like to predict
overwrite was set to False, so I am only working on cases that haven't been predicted yet. That's 1 cases.
old shape: (951, 512, 512), new_shape: [951 557 557], old_spacing: [0.8000490069389343, 0.7429999709129333, 0.7429999709129333], new_spacing: [0.800000011920929, 0.6830000281333923, 0.6830000281333923], fn_data: functools.partial(<function resample_data_or_seg_to_shape at 0x7fbd0907a340>, is_seg=False, order=3, order_z=0, force_separate_z=None)

Predicting xxxx_label.nii.gz:
perform_everything_on_gpu: False
Input shape: torch.Size([1, 951, 557, 557])
step_size: 0.5
mirror_axes: (0, 1, 2)
n_steps 792, image size is torch.Size([951, 557, 557]), tile_size [160, 128, 112], tile_step_size 0.5
steps:
[[0, 79, 158, 237, 316, 396, 475, 554, 633, 712, 791], [0, 61, 123, 184, 245, 306, 368, 429], [0, 56, 111, 167, 222, 278, 334, 389, 445]]
preallocating arrays
running prediction

  0%|          | 0/792 [00:00<?, ?it/s]
  0%|          | 1/792 [00:03<39:59,  3.03s/it]
  0%|          | 2/792 [00:03<19:10,  1.46s/it]
  0%|          | 3/792 [00:03<12:31,  1.05it/s]
  1%|          | 4/792 [00:04<09:24,  1.40it/s]
  1%|          | 5/792 [00:04<07:41,  1.71it/s]
  1%|          | 6/792 [00:04<06:38,  1.97it/s]
  1%|          | 7/792 [00:05<05:59,  2.19it/s]
  1%|          | 8/792 [00:05<05:32,  2.35it/s]
  1%|          | 9/792 [00:05<05:15,  2.48it/s]
  1%|▏         | 10/792 [00:06<05:03,  2.58it/s]
  1%|▏         | 11/792 [00:06<04:54,  2.65it/s]
..........
 99%|█████████▉| 788/792 [04:46<00:01,  2.79it/s]
100%|█████████▉| 789/792 [04:46<00:01,  2.79it/s]
100%|█████████▉| 790/792 [04:46<00:00,  2.74it/s]
100%|█████████▉| 791/792 [04:47<00:00,  2.76it/s]
100%|██████████| 792/792 [04:47<00:00,  2.76it/s]
100%|██████████| 792/792 [04:47<00:00,  2.75it/s]

Prediction done, transferring to CPU if needed
sending off prediction to background worker for resampling and export
done with xxxx_label.nii.gz

Inference settings are:

   num_parts: 1,
   num_processes_preprocessing: 1,
   num_processes_segmentation_export: 1,
   overwrite: false,
   torch.set_num_threads(1)
   torch.set_num_interop_threads(1)
    part_id: 0,
    save_probabilities: false
    use_gaussian=True,
    use_mirroring=True,
    perform_everything_on_gpu=False,
    device=torch.device('cuda', 0),
    default_num_processes = 12

I tried to increase the number of workers for preprocessing/segmentation export (to 4) and perform everything on GPU, but there were no changes. Have you faced such an issue before?

Thanks for help in advance,

FabianIsensee commented 9 months ago

Hey, so this is a relatively large image. The message done with xxxx_label.nii.gz merely indicates that the GPU prediction is done. Following that, a background worker still has to resize the prediction to the correct shape, generate a segmentation map from it and finally export it as a nifti file. For images that large this can take a while. You could check top/htop to see whether this process is still running. nnUNet_predict will only exit once the export of the prediction is complete. So being patient is the name of the game ;-) Allocating more workers won't help because the resizing is always single threaded (thanks, scipy)

mazensoufi commented 9 months ago

@FabianIsensee Thanks a lot for your response. I understood the situation, will try to run it again and wait patiently :-) .

FabianIsensee commented 8 months ago

Hey, how did it go? Did the inference complete?

mazensoufi commented 8 months ago

Hi, thanks for the question. Nope, though I was patient (I waited for a few days on a single volume), the job was still running but didn't get any output. I skipped the resampling (the voxel spacing was not far smaller than the training data), and it was quickly completed with acceptable accuracy, so the resampling step seemed to be the bottleneck. I still didn't find the reason behind the issue.

FabianIsensee commented 8 months ago

You are not able to share your image & checkpoint I presume? If you can share them I could dig on our end :-)

mazensoufi commented 8 months ago

Thanks a lot always for your help. Unfortunately, sharing the model and data is not possible due to the ethical restrictions. I would appreciate it if you found any solution and shared it in the future.

FabianIsensee commented 8 months ago

Well that's a bummer! Have you tried debugging things on your end? You could try running inference in one of the other ways nnU-Net supports: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/inference/readme.md

mazensoufi commented 8 months ago

Thanks, I will try the other inference methods on that case (I'm using some of those for processing large batches of volumes, so helpful!)

For the debugging, I tried to apply the inference on the part of the original volume by reducing the number of slices in the Z direction, but that also did not change the behavior. I will also dig deeper to see where the resampling is getting stuck and share if I find any solution.

FabianIsensee commented 8 months ago

any update on this?

mazensoufi commented 8 months ago

sorry no updates yet as I got busy with other issues, will close it for now and reopen when I get updates on it thanks a lot for your help

mazensoufi commented 3 months ago

@FabianIsensee Hi Fabian, sorry it took a while, but I tried debugging and found the issue was related to the lru_cachedecorator settings at the plans_handler.py resampling functions.

The maxsize parameter is set in default to 128 in multiple functions (e.g. the resampling ones, like resampling_fn_seg), but when the volume is too large, the number of the extracted subvolumes from the large volume also becomes much larger than 128, leading the CPU doing the resampling to crash or enter infinite waiting.

After setting the maxsizeto None, the problem was solved. However, it would cause an infinite expansion of the cache, so I think it should be used carefully (or maybe adjusted dynamically based on the segmented volume...)

Please let me know if you have any comments on this solution.

FabianIsensee commented 3 months ago

I am not sure I agree on the interpretation. We always specify a maxsize and our specified value is typically quite small (1-10). resampling_fn_seg for example hat a maxsize=1. Regardless, the caching here is completely unrelated to the subvolumes. Not only because the subvolumes are merged before passing the result to the resampling function but more importantly because resampling_fn_seg only returns a function pointer to the resampling function and is not the resampling function itsels. So no computed values are cached, just what function is supposed to be used. Since the output is also always the same, there should be no effect to modifying the maxsize argument of the lru_cache wrapping that function. Do you have any other explanation for why it works now or did you change other things as well?

MIC-DKFZ / nnUNet

Inferencer freezes after finishing the prediction #1896