out of memory for base training

infrontofme commented 4 years ago

I am reproducing the result using the instruction provided in the README file.

I am training base model with 1 GeForce GTX 1080 Ti with 12GB of memory. I modify batch_size=32.

when it runs about 20 epoches, cuda run time error occurs.

2020-05-29 13:14:00 epoch 20/177, processed 291080 samples, lr 0.000333
291112: nGT 77, recall 66, proposals 235, loss: x 2.222131, y 2.640358, w 2.185382, h 1.743314, conf 52.697956, cls 99.193832, total 160.682968
291144: nGT 77, recall 62, proposals 243, loss: x 1.478266, y 1.245305, w 2.208532, h 0.684470, conf 43.636837, cls 76.594849, total 125.848259
291176: nGT 70, recall 63, proposals 243, loss: x 1.873798, y 1.179447, w 1.839549, h 1.049649, conf 52.927620, cls 101.017876, total 159.887939
291208: nGT 75, recall 67, proposals 175, loss: x 1.820341, y 1.697263, w 1.052775, h 0.799489, conf 50.626663, cls 113.858749, total 169.855286
291240: nGT 105, recall 93, proposals 253, loss: x 3.521058, y 2.495901, w 3.214825, h 2.059216, conf 74.303398, cls 172.366638, total 257.961029
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train_meta.py", line 325, in <module>
    train(epoch)
  File "train_meta.py", line 223, in train
    loss.backward()
  File "/home/super/anaconda3/envs/torch0.3.1/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/super/anaconda3/envs/torch0.3.1/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:5

How can I solve this problem. Thanks:)

infrontofme commented 4 years ago

I solved it by adding del loss, output after loss.backward as well as reducing batch size

zhoushuang66 commented 4 years ago

This method is not useful. Every time 20 cycles, out of memory appears. Is there any other way?thanks

zhoushuang66 commented 4 years ago

Thanks a lot, it is solved.

bingykang / Fewshot_Detection

out of memory for base training #37