Closed leminhhuy132 closed 2 years ago
What is the size of your image? Observe the fluctuation of the CUDA memory during the operation of the program.
When i calculate the psnr and update them i get the above error and my cuda memory as shown below but if i calculate the psnr but don't update them no more error.
The PSNR model is small and fast, it shouldn't take up so much memory, can you tell me the memory usage with and without PSNR?
When i comment psnr in train function, cuda memory as shown below
Does this issue occur during training or during testing?
train() or validate()?
only in train()
CUDA version? CUDNN version? PyTorch & Torchvision version?
In the current code, the psnr calculation will not be called, but will be called in validate(), check the difference between some native code and the latest code on Git?
CUDA 11.1 CUDNN 8.0.5 Pytorch 1.11.0+cu113 Torchvision 0.12.0+cu113 I run this in gg colab
yes, but i don't know why when i call train() it gives error
give me your config.py file. I check it
I edit your code in this repo, please help me check it git@github.com:leminhhuy132/SRGAN_DACN.git
I edit your code in this repo, please help me check it git@github.com:leminhhuy132/SRGAN_DACN.git
in config.py 46 line batch_size 256? lt is so big?
i tried batch_size 16, it still got error
There may be multiple tasks running on the machine at the same time, or the machine may not have enough memory
why when i want to calculate psnr during training i get this error even though the batch_size config is very small?
Load all datasets successfully. Build SRResNet model successfully. Define all loss functions successfully. Define all optimizer functions successfully. Check whether the pretrained model is restored... Epoch: [1][ 0/1989] Time 15.878 (15.878) Data 0.000 ( 0.000) Loss 0.267222 (0.267222) Traceback (most recent call last): File "train_srresnet.py", line 463, in
main()
File "train_srresnet.py", line 98, in main
train_loss = train(model, train_prefetcher, pixel_criterion, optimizer, epoch, scaler, writer, psnr_model, ssim_model)
File "train_srresnet.py", line 249, in train
scaler.scale(loss).backward()
File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 14.76 GiB total capacity; 11.70 GiB already allocated; 123.75 MiB free; 13.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF