out of memory for fine tuning

bingykang / Fewshot_Detection

Few-shot Object Detection via Feature Reweighting

https://arxiv.org/abs/1812.01866

526 stars 111 forks source link

out of memory for fine tuning #22

Open quanvuong opened 4 years ago

quanvuong commented 4 years ago

I am reproducing the result using the instruction provided in the README file.

I was able to train the base model and obtain AP of 0.6862, which matches what the paper reports. However, when I tried to run the fine-tuning command, the process exits with an out of memory error for the backward pass.

I am training with 4 GeForce GTX 1080 Ti with roughly 12Gb of memory. Did you use GPUs with more memory or is something weird happening?

quanvuong commented 4 years ago

adding del loss and torch.cuda.empty_cache() solve this problem

quanvuong commented 4 years ago

Actually, using empty_cached() leads to really slow GPU operations (60 hours for the fine tuning step). Is there another work around?

If I simply do del loss without emptying the cache, the out of memory error still happens.

shenglih commented 4 years ago

torch.cuda.empty_cache()

hi @quanvuong would you mind elaborating where to add these, much appreciated!

quanvuong commented 4 years ago

you can add it after loss.backward()

thsunkid commented 4 years ago

Hi @quanvuong,

Had you solved this problem ? I got one similar when evaluating the baseline model, which caused CUDA error: out of memory due to accumulate the data from each iter. I used torch 0.4.1 version. Already try to emty_cache() both del metax, mask but it doesn't help.

thsunkid commented 4 years ago

Hi @quanvuong,

Had you solved this problem ? I got one similar when evaluating the baseline model, which caused CUDA error: out of memory due to accumulate the data from each iter. I used torch 0.4.1 version. Already try to emty_cache() both del metax, mask but it doesn't help.

In my cases, I used torch v0.4.1 instead of v0.3.1 like the author used. I solved my problem by adding with torch.no_grad() during validation because volatile variable in Variable class no longer clear the gradient value, causing accumulated memory in GPU.

Fangyi-Chen commented 4 years ago

Based on my understanding, there are two reasons for the out-of-memory during tuning

during the tuning phase, 20 class instead of 15 classes are fed into the re-weighting net. which causes more GPU memory usage.
During the tuning phase, for multi-scale training, the input images can be as large as 600+. which leads to dynamic memory usage.

The solution could be 1. decrease the batch size a little bit

resize the input image size carefully