Open mazensoufi opened 9 months ago
Hey, so this is a relatively large image. The message done with xxxx_label.nii.gz
merely indicates that the GPU prediction is done. Following that, a background worker still has to resize the prediction to the correct shape, generate a segmentation map from it and finally export it as a nifti file. For images that large this can take a while. You could check top/htop to see whether this process is still running. nnUNet_predict will only exit once the export of the prediction is complete.
So being patient is the name of the game ;-) Allocating more workers won't help because the resizing is always single threaded (thanks, scipy)
@FabianIsensee Thanks a lot for your response. I understood the situation, will try to run it again and wait patiently :-) .
Hey, how did it go? Did the inference complete?
Hi, thanks for the question. Nope, though I was patient (I waited for a few days on a single volume), the job was still running but didn't get any output. I skipped the resampling (the voxel spacing was not far smaller than the training data), and it was quickly completed with acceptable accuracy, so the resampling step seemed to be the bottleneck. I still didn't find the reason behind the issue.
You are not able to share your image & checkpoint I presume? If you can share them I could dig on our end :-)
Thanks a lot always for your help. Unfortunately, sharing the model and data is not possible due to the ethical restrictions. I would appreciate it if you found any solution and shared it in the future.
Well that's a bummer! Have you tried debugging things on your end? You could try running inference in one of the other ways nnU-Net supports: https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunetv2/inference/readme.md
Thanks, I will try the other inference methods on that case (I'm using some of those for processing large batches of volumes, so helpful!)
For the debugging, I tried to apply the inference on the part of the original volume by reducing the number of slices in the Z direction, but that also did not change the behavior. I will also dig deeper to see where the resampling is getting stuck and share if I find any solution.
any update on this?
sorry no updates yet as I got busy with other issues, will close it for now and reopen when I get updates on it thanks a lot for your help
@FabianIsensee Hi Fabian,
sorry it took a while, but I tried debugging and found the issue was related to the lru_cache
decorator settings at the plans_handler.py
resampling functions.
The maxsize
parameter is set in default to 128 in multiple functions (e.g. the resampling ones, like resampling_fn_seg
), but when the volume is too large, the number of the extracted subvolumes from the large volume also becomes much larger than 128, leading the CPU doing the resampling to crash or enter infinite waiting.
After setting the maxsize
to None
, the problem was solved. However, it would cause an infinite expansion of the cache, so I think it should be used carefully (or maybe adjusted dynamically based on the segmented volume...)
Please let me know if you have any comments on this solution.
I am not sure I agree on the interpretation. We always specify a maxsize and our specified value is typically quite small (1-10). resampling_fn_seg
for example hat a maxsize=1.
Regardless, the caching here is completely unrelated to the subvolumes. Not only because the subvolumes are merged before passing the result to the resampling function but more importantly because resampling_fn_seg
only returns a function pointer to the resampling function and is not the resampling function itsels. So no computed values are cached, just what function is supposed to be used. Since the output is also always the same, there should be no effect to modifying the maxsize argument of the lru_cache wrapping that function.
Do you have any other explanation for why it works now or did you change other things as well?
Hi Fabian, First, thanks a lot for your amazing tool!
I'm facing an issue with the inference, using nnunetv2. I'm running it in 3D full resolution mode, but the inference is stopping after issuing the "done with xxx.nii.gz" message. The node is still running, but no output in the predictions folder. Other cases were segmented properly, facing this only with a specific case. Tried it at several nodes (different GPUs, the one below is the result by a node with RTX A6000 (RAM 128 GB, CPU:Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz, 12 cores allocated for inference).
My log is here (I'm outputting some intermediate variables for debugging):
Inference settings are:
I tried to increase the number of workers for preprocessing/segmentation export (to 4) and perform everything on GPU, but there were no changes. Have you faced such an issue before?
Thanks for help in advance,