MIC-DKFZ / nnUNet

Apache License 2.0
5.82k stars 1.74k forks source link

Runtime error on Azure batch cluster #2206

Closed TucoFernandes closed 5 months ago

TucoFernandes commented 5 months ago

I trained a model using the nnuet framework on an Azure VM. It worked great. I then tested it on new images and the nnUnet predictor worked well too. Now I have a BIG batch of images to score and I'm creating a pipeline job for that. I was able to preprocess DICOM images in the pipeline job and when I read the first Nifti file, the process stopped right after the score was finished and it is going to write down the result prediction. Has anyone had any issue like that? Thank you in advance.

This is the error:

Traceback (most recent call last): File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/site-packages/nnunetv2/inference/data_iterators.py", line 58, in preprocess_fromfiles_save_to_queue raise e File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/site-packages/nnunetv2/inference/data_iterators.py", line 50, in preprocess_fromfiles_save_to_queue target_queue.put(item, timeout=0.01) File "", line 2, in put File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/multiprocessing/managers.py", line 821, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 568, in reduce_storage fd, size = storage._share_fdcpu() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/site-packages/torch/storage.py", line 294, in wrapper return fn(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azureml-envs/azureml_c4bcd504b3ff56cfbcfd2bba26738797/lib/python3.11/site-packages/torch/storage.py", line 364, in _share_fdcpu return super()._share_fdcpu(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: unable to write to file : No space left on device (28)

TucoFernandes commented 5 months ago

This issue was not related to NNUNet. I solved it by changing the docker configuration to have higher shared memory. Thanks for the great tool. đź‘Ť

HuiLin0220 commented 4 months ago

What memory do you use