MIC-DKFZ / nnDetection

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.
Apache License 2.0
557 stars 100 forks source link

Shared Memory need 24GB - Can this requirement be reduced? #113

Closed QaisarRajput closed 1 year ago

QaisarRajput commented 2 years ago

Hi, Thanks for the amazing repo. we are using this on our own dataset, where the infrastructure we have cannot allow for shared memory beyond a certain point (i.e. 12GB). I am getting below error. Is there a way we can reduce this need by reducing multiprocessing parameter or something. For example reducing the concurrent processing. I am not sure the if env_det_num_threads=6 as that control CPU usage and this is related to torch (GPU). Please correct if there is a lack in understanding in my part.

 7.7 M     Trainable params
     - 0         Non-trainable params
     - 7.7 M     Total params
     - 30.798    Total estimated model params size (MB)
     - Validation sanity check: 0it [00:00, ?it/s]INFO Using validation DataLoader3DOffset with {}
     - INFO Building Sampling Cache for Dataloder
     - Sampling Cache:   0%|          | 0/4 [00:00<?, ?it/s]
     - Sampling Cache: 100%|██████████| 4/4 [00:00<00:00, 15917.66it/s]
     - INFO Using 5 num_processes and 2 num_cached_per_queue for augmentation.
     - INFO VALIDATION KEYS:
     -  odict_keys(['case_0', 'case_15', 'case_2', 'case_5'])
     - Traceback (most recent call last):
     -   File "/usr/local/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
     -     obj = _ForkingPickler.dumps(obj)
     -   File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
     -     cls(buf, protocol).dump(obj)
     -   File "/usr/local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
     -     fd, size = storage._share_fd_cpu_()
     - RuntimeError: unable to write to file </torch_211_624382216_0>: No space left on device (28)
     - Traceback (most recent call last):
     -   File "/usr/local/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
     -     obj = _ForkingPickler.dumps(obj)
     -   File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
     -     cls(buf, protocol).dump(obj)
     -   File "/usr/local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
     -     fd, size = storage._share_fd_cpu_()
     - RuntimeError: unable to write to file </torch_203_1922446424_4>: No space left on device (28)
     - /home/dataiku/code/nndet/nndet/core/retina.py:358: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
mibaumgartner commented 2 years ago

Dear @QaisarRajput ,

just to clarify: while the error is thrown by torch, the shared memory does not relate to the available VRAM (of the GPU) but to the RAM (of the CPU). I was able to run the container with 12GB of shared memory as well, maybe there is no space left on your harddrive/ssd? (note: the error does not refer to missing VRAM, that would be an CUDA out of memory error)

Best, Michael

mibaumgartner commented 1 year ago

Since there was no update for some time I’ll close this Issue for now. Please feel free to reopen this Issue if the problem persists or open a new one.