Lornatang / SRGAN-PyTorch

A simple and complete implementation of super-resolution paper.
Apache License 2.0
410 stars 105 forks source link

CUDA out of memory #55

Closed leminhhuy132 closed 2 years ago

leminhhuy132 commented 2 years ago

why when i want to calculate psnr during training i get this error even though the batch_size config is very small?

Load all datasets successfully. Build SRResNet model successfully. Define all loss functions successfully. Define all optimizer functions successfully. Check whether the pretrained model is restored... Epoch: [1][ 0/1989] Time 15.878 (15.878) Data 0.000 ( 0.000) Loss 0.267222 (0.267222) Traceback (most recent call last): File "train_srresnet.py", line 463, in main() File "train_srresnet.py", line 98, in main train_loss = train(model, train_prefetcher, pixel_criterion, optimizer, epoch, scaler, writer, psnr_model, ssim_model) File "train_srresnet.py", line 249, in train scaler.scale(loss).backward() File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 363, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 14.76 GiB total capacity; 11.70 GiB already allocated; 123.75 MiB free; 13.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Lornatang commented 2 years ago

What is the size of your image? Observe the fluctuation of the CUDA memory during the operation of the program.

leminhhuy132 commented 2 years ago

When i calculate the psnr and update them i get the above error and my cuda memory as shown below but if i calculate the psnr but don't update them no more error. Screenshot from 2022-06-10 08-15-13

Lornatang commented 2 years ago

The PSNR model is small and fast, it shouldn't take up so much memory, can you tell me the memory usage with and without PSNR?

leminhhuy132 commented 2 years ago

When i comment psnr in train function, cuda memory as shown below Screenshot from 2022-06-10 09-36-40

Lornatang commented 2 years ago

Does this issue occur during training or during testing?

Lornatang commented 2 years ago

train() or validate()?

leminhhuy132 commented 2 years ago

only in train()

Lornatang commented 2 years ago

CUDA version? CUDNN version? PyTorch & Torchvision version?

Lornatang commented 2 years ago

In the current code, the psnr calculation will not be called, but will be called in validate(), check the difference between some native code and the latest code on Git?

leminhhuy132 commented 2 years ago

CUDA 11.1 CUDNN 8.0.5 Pytorch 1.11.0+cu113 Torchvision 0.12.0+cu113 I run this in gg colab

leminhhuy132 commented 2 years ago

yes, but i don't know why when i call train() it gives error

Lornatang commented 2 years ago

give me your config.py file. I check it

leminhhuy132 commented 2 years ago

I edit your code in this repo, please help me check it git@github.com:leminhhuy132/SRGAN_DACN.git

Lornatang commented 2 years ago

I edit your code in this repo, please help me check it git@github.com:leminhhuy132/SRGAN_DACN.git

in config.py 46 line batch_size 256? lt is so big?

leminhhuy132 commented 2 years ago

i tried batch_size 16, it still got error

Lornatang commented 2 years ago

There may be multiple tasks running on the machine at the same time, or the machine may not have enough memory