MASILab / 3DUX-Net

238 stars 33 forks source link

I ran across cuda out of memory when validating #46

Open uto-lt opened 11 months ago

uto-lt commented 11 months ago

Hello, Thanks for your excellent work, and I am interested in your work very much. It goes well when I use the public dataset that you use in the code. But when I trained with my own dataset, which is a big one and the resolution of the images is also high, it arose the following error.

Traceback (most recent call last):
  File "/home/lt/Desktop/FLARE23_code/3DUX-Net-main/main_finetune_pal.py", line 328, in <module>
    global_step, dice_val_best, global_step_best = train(
  File "/home/lt/Desktop/FLARE23_code/3DUX-Net-main/main_finetune_pal.py", line 288, in train
    dice_val = validation(epoch_iterator_val)
  File "/home/lt/Desktop/FLARE23_code/3DUX-Net-main/main_finetune_pal.py", line 245, in validation
    val_output_convert = [
  File "/home/lt/Desktop/FLARE23_code/3DUX-Net-main/main_finetune_pal.py", line 246, in <listcomp>
    post_pred(val_pred_tensor) for val_pred_tensor in val_outputs_list
  File "/home/lt/anaconda3/envs/lt_python39/lib/python3.9/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper
    return func(*args, **kwargs)
  File "/home/lt/anaconda3/envs/lt_python39/lib/python3.9/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper
    return func(*args, **kwargs)
  File "/home/lt/anaconda3/envs/lt_python39/lib/python3.9/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper
    return func(*args, **kwargs)
  [Previous line repeated 1 more time]
  File "/home/lt/anaconda3/envs/lt_python39/lib/python3.9/site-packages/monai/transforms/post/array.py", line 242, in __call__
    img_t = one_hot(img_t, num_classes=to_onehot, dim=0)
  File "/home/lt/anaconda3/envs/lt_python39/lib/python3.9/site-packages/monai/networks/utils.py", line 96, in one_hot
    o = torch.zeros(size=sh, dtype=dtype, device=labels.device)
RuntimeError: CUDA out of memory. Tried to allocate 14.21 GiB (GPU 0; 47.54 GiB total capacity; 35.22 GiB already allocated; 4.58 GiB free; 40.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It worked well during the training phase, but the code reported an error during the validation phase. Do you know how to solve this problem? I would appreciate it.

leeh43 commented 11 months ago

Hi, thank you for your interest towards our work! What is your batch size, cache rate for training? I believe the preprocessing for FLARE dataset have been downsampled to 1.0x1.0x1.2. It shouldn't have any problem for both training and validation.

It will be great to let me also know the minimum and maximum resolution of x, y, z-axis across all samples in the dataset.

uto-lt commented 11 months ago

Hi, thanks for your reply @leeh43 I have found a solution to the question by using the device=torch.device('cpu') of sliding_window_inference, which will leave gpu the small patch of image for inference and the rest patches will be cached in memory.

By the way, I still have a question, since I am participating in the FLARE23 competition, and the maximum resolution of x, y, z-axis across all samples in the dataset can be 512x512x512, the competition requires us to use less than 28g memory. If I want to use less memory, is the spacing as large as possible or as small as possible?

Looking forward to your help~