MIC-DKFZ / nnUNet

Apache License 2.0
5.79k stars 1.74k forks source link

Prediction - One or more background workers are no longer alive #1916

Closed SanBast closed 8 months ago

SanBast commented 9 months ago

Hello, I was trying to predict liver segmentations on my own with nnUNetv2. With other datasets the following never happened (tried on prostate, spleen, hippocampus, heart, pancreas):

Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

There are 110 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 110 cases that I would like to predict
using pin_memory on device 0

Predicting liver_0:
perform_everything_on_gpu: True
Prediction done, transferring to CPU if needed
sending off prediction to background worker for resampling and export
done with liver_0

Predicting liver_1:
perform_everything_on_gpu: True
Prediction done, transferring to CPU if needed
sending off prediction to background worker for resampling and export
done with liver_1

Predicting liver_100:
perform_everything_on_gpu: True
Prediction done, transferring to CPU if needed
sending off prediction to background worker for resampling and export
done with liver_100
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/data/marciano/my_envs/FORE/lib/python3.8/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

As I can see I think the problem is in batchgenerators. However, I don't know how to procede. I already read all the other threads concerning this issue (I even try to set the OMP_NUM_THREADS=1 in the cmd line), but nothing solved it.

Can you please help me with this? Thanks for your kind attention, V

SanBast commented 8 months ago

@GregorKoehler @FabianIsensee Hi again, in awaiting your response I run other experiments. The problem was still the liver database. I suppose there are some limitation concerning the nnUNetv2_plan_and_preprocess (in particular batchgenerators) if there's a "limited" CPU memory. With other datasets this do not happen.

So I was wondering if you can run the same experiment I did, i.e. using the MSD Liver dataset and splitting a train,test,val set on the imagesTr set (so using even a lower number than the one experimented in your original paper) and monitor the memory usage of this preprocessing. My fear is that for certain datasets nnUNet and batchgenerators require a minimum amount of memory specs that must be declared for future usage...

I'm available for any discussion :)

FabianIsensee commented 8 months ago

Are you on the latest master? We do not use batchgenerators in inference. Also not during preprocessing. But yes, CPU memory can be a problem. We recommend reducing the number of workers for that. In nnUNetv2_plan_and_preprocess this would be -np 1 and for nnUNetv2_predict this is -npp 1 -nps 1. We cannot go lower than a single worker, so at some point the RAM is just too small. Last effort to save things could be a swap partition located on an SSD

SanBast commented 8 months ago

To be fair, I don't know. I just followed the instruction here with a basic pip install nnunet.

But this branch doubt actually answers something that I noticed, so maybe I am not in the right nnUNetV2 branch: by using the library installed with pip, I got some modules being changed from the one on github (this module is an example).

Let me clone the repo directly, and I'll come back here to update :)

Thanks for the hint

FabianIsensee commented 8 months ago

any updates on your issue?

SanBast commented 8 months ago

Hi @FabianIsensee Apparently that solved the issue. I thank you very much for you time!

Best, V

FabianIsensee commented 8 months ago

Glad it works now!