NVIDIA / ContrastiveLosses4VRD

Implementation for the CVPR2019 paper "Graphical Contrastive Losses for Scene Graph Generation"
Other
200 stars 41 forks source link

out of memory error when training #1

Closed ghost closed 5 years ago

ghost commented 5 years ago

To train our relationship network using a VGG16 backbone, run

python tools/train_net_step_rel.py --dataset vg --cfg configs/vg/e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt.yaml --nw 8 --use_tfboard

i met an out of memory error:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory

someone adviced to reduce the batch size, but the batch size is already minium which is eqauls to the number of GPU i use. i just don't know other methods to avoid this error.

ghost commented 5 years ago

this is the detail logs

INFO train_net_step_rel.py: 406: Training starts !
INFO net_rel.py:  48: Changing learning rate 0.000000 -> 0.003333
[Jun06-13-05-06_5350d9cd6e33_step_with_prd_cls_v3][e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt.yaml][Step 1 / 62723]
                loss: 2.861859, lr: 0.003333 backbone_lr: 0.000333 time: 10.651057, eta: 7 days, 17:34:26
                accuracy_cls: 0.824219, accuracy_cls_ttl: 0.654297
                loss_rpn_cls: 0.110185, loss_rpn_bbox: 0.164759, loss_cls: 0.584456, loss_bbox: 0.230197, loss_cls_ttl: 1.230690, loss_contrastive_sbj: 0.254844, loss_contrastive_obj: 0.152084, loss_so_contrastive_sbj: 0.042665, loss_so_contrastive_obj: 0.064662, loss_p_contrastive_sbj: 0.015739, loss_p_contrastive_obj: 0.011577
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
INFO train_net_step_rel.py: 476: Save ckpt on exception ...
INFO train_net_step_rel.py: 139: save model: Outputs/e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt/Jun06-13-05-06_5350d9cd6e33_step_with_prd_cls_v3/ckpt/model_step1.pth
INFO train_net_step_rel.py: 478: Save ckpt done.
Traceback (most recent call last):
  File "tools/train_net_step_rel.py", line 461, in main
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "/home/xiefangyuan/workspace/codes/ContrastiveLosses4VRD/Detectron_pytorch/lib/nn/parallel/_functions.py", line 28, in backward
    return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
  File "/home/xiefangyuan/workspace/codes/ContrastiveLosses4VRD/Detectron_pytorch/lib/nn/parallel/_functions.py", line 39, in forward
    return comm.reduce_add_coalesced(grads, destination)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 120, in reduce_add_coalesced
    flat_result = reduce_add(flat_tensors, destination)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 74, in reduce_add
    result = inp.new(device=destination).resize_as_(inp).zero_()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58
Prudhvinik1 commented 5 years ago

@Xie-Fangyuan Did you find a solution for out of memory issue? I am also facing the same.

jz462 commented 5 years ago

HI @Xie-Fangyuan, unfortunately the most direct solution is to just use a bigger GPU (a 16GB GPU should be enough). That being said, this code is always freezing the object detector's weights when training the relationship detector, which means technically you can modify it such that you can detect and save all objects into a file, then simply load these detected objects when training the relationship detector, and I'm sure that will reduce you memory usage significantly. It for sure requires quite a lot work, and I'm afraid I don't have enough time for it right now, but this will definitely be a feature that I'll consider adding later on.