torch.cuda.OutOfMemoryError: CUDA out of memory

YLM432423 commented 8 months ago

Hello, I ran the training of depth estimation on the 4090. I faced torch.cuda.OutOfMemoryError: CUDA out of memory

When I use model = torch.nn.parallel.DataParallel(model, device_ids=[args.gpu])

the code can be trained on a 4090(batch_size=3). However, there is still an issue of memory exceeding when verifying。

Traceback (most recent call last):
  File "/root/meta-prompts/depth/train.py", line 371, in <module>
    main()
  File "/root/meta-prompts/depth/train.py", line 165, in main
    results_dict, loss_val = validate(val_loader, model, criterion_d, 
  File "/root/meta-prompts/depth/train.py", line 306, in validate
    pred = model(input_RGB)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/meta-prompts/depth/models_depth/model.py", line 142, in forward
    conv_feats = self.encoder(x)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/meta-prompts/depth/models_depth/model.py", line 84, in forward
    latents = self.encoder_vq.encode(x).mode()
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/ldm/models/autoencoder.py", line 83, in encode
    h = self.encoder(x)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/ldm/modules/diffusionmodules/model.py", line 526, in forward
    h = self.down[i_level].block[i_block](hs[-1], temb)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/ldm/modules/diffusionmodules/model.py", line 132, in forward
    h = nonlinearity(h)
  File "/root/miniconda3/envs/MetaDepth/lib/python3.9/site-packages/ldm/modules/diffusionmodules/model.py", line 43, in nonlinearity
    return x*torch.sigmoid(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.06 GiB (GPU 0; 23.65 GiB total capacity; 19.74 GiB already allocated; 1.18 GiB free; 21.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

YLM432423 commented 8 months ago

Why is it normal during training, but the video memory will explode during verification?

andreaeusebi commented 8 months ago

Hello @YLM432423 , how do you solve that? I'm running test "dist_train.sh" script and getting same error.

YLM432423 commented 7 months ago

The reason is that there is a data enhancement operation during the verification process, which results in an increase in the memory usage of the graphics card. You can delete the data augmentation action. This does not affect the validity of the model at the time of validation

water221 commented 1 month ago

Hello, do you train the model with a single 4090 and 24G video memory, and can a single card be trained for depth estimation tasks? If you can, please reply

fudan-zvg / meta-prompts

torch.cuda.OutOfMemoryError: CUDA out of memory #8