Closed ghost closed 5 years ago
this is the detail logs
INFO train_net_step_rel.py: 406: Training starts !
INFO net_rel.py: 48: Changing learning rate 0.000000 -> 0.003333
[Jun06-13-05-06_5350d9cd6e33_step_with_prd_cls_v3][e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt.yaml][Step 1 / 62723]
loss: 2.861859, lr: 0.003333 backbone_lr: 0.000333 time: 10.651057, eta: 7 days, 17:34:26
accuracy_cls: 0.824219, accuracy_cls_ttl: 0.654297
loss_rpn_cls: 0.110185, loss_rpn_bbox: 0.164759, loss_cls: 0.584456, loss_bbox: 0.230197, loss_cls_ttl: 1.230690, loss_contrastive_sbj: 0.254844, loss_contrastive_obj: 0.152084, loss_so_contrastive_sbj: 0.042665, loss_so_contrastive_obj: 0.064662, loss_p_contrastive_sbj: 0.015739, loss_p_contrastive_obj: 0.011577
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
INFO train_net_step_rel.py: 476: Save ckpt on exception ...
INFO train_net_step_rel.py: 139: save model: Outputs/e2e_faster_rcnn_VGG16_8_epochs_vg_v3_default_node_contrastive_loss_w_so_p_aware_margin_point2_so_weight_point5_no_spt/Jun06-13-05-06_5350d9cd6e33_step_with_prd_cls_v3/ckpt/model_step1.pth
INFO train_net_step_rel.py: 478: Save ckpt done.
Traceback (most recent call last):
File "tools/train_net_step_rel.py", line 461, in main
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/function.py", line 76, in apply
return self._forward_cls.backward(self, *args)
File "/home/xiefangyuan/workspace/codes/ContrastiveLosses4VRD/Detectron_pytorch/lib/nn/parallel/_functions.py", line 28, in backward
return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
File "/home/xiefangyuan/workspace/codes/ContrastiveLosses4VRD/Detectron_pytorch/lib/nn/parallel/_functions.py", line 39, in forward
return comm.reduce_add_coalesced(grads, destination)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 120, in reduce_add_coalesced
flat_result = reduce_add(flat_tensors, destination)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 74, in reduce_add
result = inp.new(device=destination).resize_as_(inp).zero_()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCStorage.cu:58
@Xie-Fangyuan Did you find a solution for out of memory issue? I am also facing the same.
HI @Xie-Fangyuan, unfortunately the most direct solution is to just use a bigger GPU (a 16GB GPU should be enough). That being said, this code is always freezing the object detector's weights when training the relationship detector, which means technically you can modify it such that you can detect and save all objects into a file, then simply load these detected objects when training the relationship detector, and I'm sure that will reduce you memory usage significantly. It for sure requires quite a lot work, and I'm afraid I don't have enough time for it right now, but this will definitely be a feature that I'll consider adding later on.
To train our relationship network using a VGG16 backbone, run
i met an out of memory error:
someone adviced to reduce the batch size, but the batch size is already minium which is eqauls to the number of GPU i use. i just don't know other methods to avoid this error.