fanq15 / FewX

FewX is an open-source toolbox on top of Detectron2 for data-limited instance-level recognition tasks.
https://github.com/fanq15/FewX
MIT License
346 stars 48 forks source link

How much memory is needed for infer? #51

Open huangluyao opened 3 years ago

huangluyao commented 3 years ago

My graphics boards is gtx1660ti, memory 6G. I run this code to report an error: RuntimeError: CUDA out of memory. Tried to allocate 1.00 GiB (GPU 0; 5.81 GiB total capacity; 2.90 GiB already allocated; 420.50 MiB free; 3.84 GiB reserved in total by PyTorch)

selimlouis commented 3 years ago

I have a similar problem. Just want to test the whole thing with my gtx 970 memory 4G.

I get:

Traceback (most recent call last):
  File "fsod_train_net.py", line 118, in <module>
    args=(args,),
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/engine/launch.py", line 62, in launch
    main_func(*args)
  File "fsod_train_net.py", line 106, in main
    return trainer.train()
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 232, in run_step
    loss_dict = self.model(data)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/selim/FewShot/FewX/fewx/modeling/fsod/fsod_rcnn.py", line 153, in forward
    support_features = self.backbone(support_images)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 444, in forward
    x = self.stem(x)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 355, in forward
    x = self.conv1(x)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/layers/wrappers.py", line 88, in forward
    x = self.norm(x)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/detectron2/layers/batch_norm.py", line 65, in forward
    eps=self.eps,
  File "/home/selim/anaconda3/envs/FewX/lib/python3.7/site-packages/torch/nn/functional.py", line 2058, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 1000.00 MiB (GPU 0; 3.94 GiB total capacity; 2.15 GiB already allocated; 340.25 MiB free; 2.79 GiB reserved in total by PyTorch)

I tried halving the BATCH_SIZE_PER_IMAGE and IMS_PER_BATCH settings in the config but I still get memory problems. I dont want to make them too small, I think it would lead to bad results. Not an expert though.

Did anyone find a solution?

selimlouis commented 3 years ago

Ok so I continued trying to get it to work.

I found success when setting the SOLVER.IMS_PER_BATCH to 1 in configs/fsod/Base-FSOD-C4.yaml

I did not run a complete training process since it would have taken me 2 days and 11 hours, but it started training without issues. Hope this helps someone else too

xiaohei1001 commented 2 years ago

It depends on your support set. Maybe you can try to make RPN.POST_NMS_TOPK_TEST small.