Media-Smart / vedadet

A single stage object detection toolbox based on PyTorch
Apache License 2.0
498 stars 128 forks source link

RuntimeError: CUDA out of memory. #78

Open vokhidovhusan opened 2 years ago

vokhidovhusan commented 2 years ago

Having problem while training tinaface. I have tried to change batchsize but I could find where I can reduce batchsize? Or is there any other way to solve 'out of memory' issue? Thanks.

$ CUDA_VISIBLE_DEVICES="1" python tools/trainval.py configs/trainval/tinaface/tinaface_r50_fpn_bn.py
configs/trainval/tinaface/tinaface_r50_fpn_bn.py
2021-12-29 14:19:44,242 - vedadet - WARNING - EvalHook is not in modes ['train']
2021-12-29 14:19:44,243 - vedadet - INFO - Loading weights from torchvision://resnet50
2021-12-29 14:19:44,324 - vedadet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.fc.weight, backbone.fc.bias

missing keys in source state_dict: neck.0.lateral_convs.0.conv.weight, neck.0.lateral_convs.0.bn.weight, neck.0.lateral_convs.0.bn.bias, neck.0.lateral_convs.0.bn.running_mean, neck.0.lateral_convs.0.bn.running_var, neck.0.lateral_convs.1.conv.weight, neck.0.lateral_convs.1.bn.weight, neck.0.lateral_convs.1.bn.bias, neck.0.lateral_convs.1.bn.running_mean, neck.0.lateral_convs.1.bn.running_var, neck.0.lateral_convs.2.conv.weight, neck.0.lateral_convs.2.bn.weight, neck.0.lateral_convs.2.bn.bias, neck.0.lateral_convs.2.bn.running_mean, neck.0.lateral_convs.2.bn.running_var, neck.0.lateral_convs.3.conv.weight, neck.0.lateral_convs.3.bn.weight, neck.0.lateral_convs.3.bn.bias, neck.0.lateral_convs.3.bn.running_mean, neck.0.lateral_convs.3.bn.running_var, neck.0.fpn_convs.0.conv.weight, neck.0.fpn_convs.0.bn.weight, neck.0.fpn_convs.0.bn.bias, neck.0.fpn_convs.0.bn.running_mean, neck.0.fpn_convs.0.bn.running_var, neck.0.fpn_convs.1.conv.weight, neck.0.fpn_convs.1.bn.weight, neck.0.fpn_convs.1.bn.bias, neck.0.fpn_convs.1.bn.running_mean, neck.0.fpn_convs.1.bn.running_var, neck.0.fpn_convs.2.conv.weight, neck.0.fpn_convs.2.bn.weight, neck.0.fpn_convs.2.bn.bias, neck.0.fpn_convs.2.bn.running_mean, neck.0.fpn_convs.2.bn.running_var, neck.0.fpn_convs.3.conv.weight, neck.0.fpn_convs.3.bn.weight, neck.0.fpn_convs.3.bn.bias, neck.0.fpn_convs.3.bn.running_mean, neck.0.fpn_convs.3.bn.running_var, neck.0.fpn_convs.4.conv.weight, neck.0.fpn_convs.4.bn.weight, neck.0.fpn_convs.4.bn.bias, neck.0.fpn_convs.4.bn.running_mean, neck.0.fpn_convs.4.bn.running_var, neck.0.fpn_convs.5.conv.weight, neck.0.fpn_convs.5.bn.weight, neck.0.fpn_convs.5.bn.bias, neck.0.fpn_convs.5.bn.running_mean, neck.0.fpn_convs.5.bn.running_var, neck.1.level_convs.0.0.conv.weight, neck.1.level_convs.0.0.bn.weight, neck.1.level_convs.0.0.bn.bias, neck.1.level_convs.0.0.bn.running_mean, neck.1.level_convs.0.0.bn.running_var, neck.1.level_convs.0.1.conv.weight, neck.1.level_convs.0.1.bn.weight, neck.1.level_convs.0.1.bn.bias, neck.1.level_convs.0.1.bn.running_mean, neck.1.level_convs.0.1.bn.running_var, neck.1.level_convs.0.2.conv.weight, neck.1.level_convs.0.2.bn.weight, neck.1.level_convs.0.2.bn.bias, neck.1.level_convs.0.2.bn.running_mean, neck.1.level_convs.0.2.bn.running_var, neck.1.level_convs.0.3.conv.weight, neck.1.level_convs.0.3.bn.weight, neck.1.level_convs.0.3.bn.bias, neck.1.level_convs.0.3.bn.running_mean, neck.1.level_convs.0.3.bn.running_var, neck.1.level_convs.0.4.conv.weight, neck.1.level_convs.0.4.bn.weight, neck.1.level_convs.0.4.bn.bias, neck.1.level_convs.0.4.bn.running_mean, neck.1.level_convs.0.4.bn.running_var, bbox_head.cls_convs.0.conv.weight, bbox_head.cls_convs.0.bn.weight, bbox_head.cls_convs.0.bn.bias, bbox_head.cls_convs.0.bn.running_mean, bbox_head.cls_convs.0.bn.running_var, bbox_head.cls_convs.1.conv.weight, bbox_head.cls_convs.1.bn.weight, bbox_head.cls_convs.1.bn.bias, bbox_head.cls_convs.1.bn.running_mean, bbox_head.cls_convs.1.bn.running_var, bbox_head.cls_convs.2.conv.weight, bbox_head.cls_convs.2.bn.weight, bbox_head.cls_convs.2.bn.bias, bbox_head.cls_convs.2.bn.running_mean, bbox_head.cls_convs.2.bn.running_var, bbox_head.cls_convs.3.conv.weight, bbox_head.cls_convs.3.bn.weight, bbox_head.cls_convs.3.bn.bias, bbox_head.cls_convs.3.bn.running_mean, bbox_head.cls_convs.3.bn.running_var, bbox_head.reg_convs.0.conv.weight, bbox_head.reg_convs.0.bn.weight, bbox_head.reg_convs.0.bn.bias, bbox_head.reg_convs.0.bn.running_mean, bbox_head.reg_convs.0.bn.running_var, bbox_head.reg_convs.1.conv.weight, bbox_head.reg_convs.1.bn.weight, bbox_head.reg_convs.1.bn.bias, bbox_head.reg_convs.1.bn.running_mean, bbox_head.reg_convs.1.bn.running_var, bbox_head.reg_convs.2.conv.weight, bbox_head.reg_convs.2.bn.weight, bbox_head.reg_convs.2.bn.bias, bbox_head.reg_convs.2.bn.running_mean, bbox_head.reg_convs.2.bn.running_var, bbox_head.reg_convs.3.conv.weight, bbox_head.reg_convs.3.bn.weight, bbox_head.reg_convs.3.bn.bias, bbox_head.reg_convs.3.bn.running_mean, bbox_head.reg_convs.3.bn.running_var, bbox_head.retina_cls.weight, bbox_head.retina_cls.bias, bbox_head.retina_reg.weight, bbox_head.retina_reg.bias, bbox_head.retina_iou.weight, bbox_head.retina_iou.bias

/home/husan/anaconda3/envs/vedadet/lib/python3.9/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  warnings.warn(
2021-12-29 14:21:11,247 - vedadet - INFO - Epoch [1][100/3221] lr: 0.001043, loss_cls: 0.4956, loss_bbox: 1.0588, loss_iou: 0.6937, loss: 2.2481
2021-12-29 14:22:38,785 - vedadet - INFO - Epoch [1][200/3221] lr: 0.001718, loss_cls: 0.3853, loss_bbox: 0.8330, loss_iou: 0.6639, loss: 1.8822
2021-12-29 14:24:05,818 - vedadet - INFO - Epoch [1][300/3221] lr: 0.002393, loss_cls: 0.3949, loss_bbox: 0.8550, loss_iou: 0.6795, loss: 1.9294
2021-12-29 14:25:32,723 - vedadet - INFO - Epoch [1][400/3221] lr: 0.003068, loss_cls: 0.6944, loss_bbox: 0.8749, loss_iou: 0.6726, loss: 2.2419
2021-12-29 14:27:00,039 - vedadet - INFO - Epoch [1][500/3221] lr: 0.003743, loss_cls: 0.3498, loss_bbox: 0.5091, loss_iou: 0.5636, loss: 1.4225
2021-12-29 14:28:27,258 - vedadet - INFO - Epoch [1][600/3221] lr: 0.00375, loss_cls: 0.3247, loss_bbox: 0.7541, loss_iou: 0.6728, loss: 1.7517
2021-12-29 14:29:54,545 - vedadet - INFO - Epoch [1][700/3221] lr: 0.00375, loss_cls: 0.4938, loss_bbox: 0.7308, loss_iou: 0.6423, loss: 1.8669
2021-12-29 14:31:21,637 - vedadet - INFO - Epoch [1][800/3221] lr: 0.00375, loss_cls: 0.2444, loss_bbox: 0.6331, loss_iou: 0.6152, loss: 1.4927
2021-12-29 14:32:49,291 - vedadet - INFO - Epoch [1][900/3221] lr: 0.00375, loss_cls: 0.3112, loss_bbox: 0.7092, loss_iou: 0.6260, loss: 1.6463
2021-12-29 14:34:16,761 - vedadet - INFO - Epoch [1][1000/3221] lr: 0.00375, loss_cls: 0.2516, loss_bbox: 0.6487, loss_iou: 0.6169, loss: 1.5171
2021-12-29 14:35:44,477 - vedadet - INFO - Epoch [1][1100/3221] lr: 0.00375, loss_cls: 0.2114, loss_bbox: 0.4864, loss_iou: 0.5374, loss: 1.2352
2021-12-29 14:37:12,607 - vedadet - INFO - Epoch [1][1200/3221] lr: 0.00375, loss_cls: 0.1675, loss_bbox: 0.4839, loss_iou: 0.5422, loss: 1.1936
2021-12-29 14:38:40,031 - vedadet - INFO - Epoch [1][1300/3221] lr: 0.00375, loss_cls: 0.2287, loss_bbox: 0.5490, loss_iou: 0.5496, loss: 1.3273
Traceback (most recent call last):
  File "/home/husan/projects/face_detection/vedadet/tools/trainval.py", line 66, in <module>
    main()
  File "/home/husan/projects/face_detection/vedadet/tools/trainval.py", line 62, in main
    trainval(cfg, distributed, logger)
  File "/home/husan/projects/face_detection/vedadet/vedadet/assembler/trainval.py", line 86, in trainval
    looper.start(cfg.max_epochs)
  File "/home/husan/projects/face_detection/vedadet/vedacore/loopers/epoch_based_looper.py", line 29, in start
    self.epoch_loop(mode)
  File "/home/husan/projects/face_detection/vedadet/vedacore/loopers/epoch_based_looper.py", line 17, in epoch_loop
    self.cur_results[mode] = engine(data)
  File "/home/husan/anaconda3/envs/vedadet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/husan/projects/face_detection/vedadet/vedacore/parallel/data_parallel.py", line 30, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/husan/anaconda3/envs/vedadet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/husan/projects/face_detection/vedadet/vedadet/engines/train_engine.py", line 20, in forward
    return self.forward_impl(**data)
  File "/home/husan/projects/face_detection/vedadet/vedadet/engines/train_engine.py", line 29, in forward_impl
    losses = self.criterion.loss(feats, img_metas, gt_labels, gt_bboxes,
  File "/home/husan/projects/face_detection/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py", line 412, in loss
    cls_reg_targets = self.get_targets(
  File "/home/husan/projects/face_detection/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py", line 252, in get_targets
    results = multi_apply(
  File "/home/husan/projects/face_detection/vedadet/vedacore/misc/utils.py", line 16, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/husan/projects/face_detection/vedadet/vedadet/criteria/iou_bbox_anchor_criterion.py", line 137, in _get_targets_single
    assign_result = self.assigner.assign(
  File "/home/husan/projects/face_detection/vedadet/vedadet/misc/bbox/assigners/max_iou_assigner.py", line 107, in assign
    overlaps = self.iou_calculator(gt_bboxes, bboxes)
  File "/home/husan/projects/face_detection/vedadet/vedadet/misc/bbox/iou_calculators/iou2d_calculator.py", line 32, in __call__
    return bbox_overlaps(bboxes1, bboxes2, mode, is_aligned)
  File "/home/husan/projects/face_detection/vedadet/vedadet/misc/bbox/bbox.py", line 79, in bbox_overlaps
    wh = (rb - lt).clamp(min=0)  # [rows, cols, 2]
RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 7.80 GiB total capacity; 6.34 GiB already allocated; 36.44 MiB free; 6.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF