PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.62k stars 2.87k forks source link

Mask evaluation CUDA OOM because of original image size before augmentation (downscaled) #8821

Open ExtReMLapin opened 7 months ago

ExtReMLapin commented 7 months ago

问题确认 Search before asking

Bug组件 Bug Component

Validation

Bug描述 Describe the Bug

Training on detection works fine, with my own dataset, however, when switching from cascade_rcnn_r50_vd_fpn_ssld_2x to cascade_mask_rcnn_r50_vd_fpn_ssld_2x it CUDA OOM on first eval

All the batches are set to the lowest possible to reduce vram usage, it uses only 07/40 GB of VRAM during training.

When first epochs finishes vram usage goes from ~7Gb to 20-38Gb and then CUDA OOM

[02/20 16:34:57] ppdet.engine INFO: Epoch: [0] [1600/1640] learning_rate: 0.010000 loss_rpn_cls: 0.110011 loss_rpn_reg: 0.069790 loss_bbox_cls_stage0: 0.184521 loss_bbox_reg_stage0: 0.167128 loss_bbox_cls_stage1: 0.110849 loss_bbox_reg_stage1: 0.149792 loss_bbox_cls_stage2: 0.059931 loss_bbox_reg_stage2: 0.061344 loss_mask: 0.533583 loss: 1.436356 eta: 10:23:44 batch_cost: 0.1259 data_cost: 0.0339 ips: 7.9413 images/s
[02/20 16:35:01] ppdet.engine INFO: Epoch: [0] [1620/1640] learning_rate: 0.010000 loss_rpn_cls: 0.074318 loss_rpn_reg: 0.062690 loss_bbox_cls_stage0: 0.186860 loss_bbox_reg_stage0: 0.196775 loss_bbox_cls_stage1: 0.121721 loss_bbox_reg_stage1: 0.184845 loss_bbox_cls_stage2: 0.070398 loss_bbox_reg_stage2: 0.099984 loss_mask: 0.485207 loss: 1.514559 eta: 10:27:12 batch_cost: 0.2233 data_cost: 0.1298 ips: 4.4790 images/s
[02/20 16:35:06] ppdet.utils.checkpoint INFO: Save checkpoint: output
EvalReader 14
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
[02/20 16:35:07] ppdet.data.source.coco INFO: Load [702 samples valid, 0 samples invalid] in file dataset/fp_zones/val.json.
self.use_shared_memory  False
loading annotations into memory...
Done (t=0.07s)
creating index...
index created!
[02/20 16:35:09] ppdet.engine INFO: Eval iter: 0
Traceback (most recent call last):
  File "/opt/paddlepaddledetect/PaddleDetection/tools/train.py", line 209, in <module>
    main()
  File "/opt/paddlepaddledetect/PaddleDetection/tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "/opt/paddlepaddledetect/PaddleDetection/tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/engine/trainer.py", line 639, in train
    self._eval_with_loader(self._eval_loader)
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/engine/trainer.py", line 672, in _eval_with_loader
    outs = self.model(data)
  File "/opt/paddlepaddledetect/venv/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1254, in __call__
    return self.forward(*inputs, **kwargs)
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 76, in forward
    outs.append(self.get_pred())
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 136, in get_pred
    bbox_pred, bbox_num, mask_pred = self._forward()
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/modeling/architectures/cascade_rcnn.py", line 120, in _forward
    mask_pred = self.mask_post_process(mask_out, bbox_pred, bbox_num,
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/modeling/post_process.py", line 250, in __call__
    pred_mask = paste_mask(mask_out_i[:, None, :, :],
  File "/opt/paddlepaddledetect/PaddleDetection/ppdet/modeling/post_process.py", line 679, in paste_mask
    grid = paddle.stack([gx, gy], axis=3)
  File "/opt/paddlepaddledetect/venv/lib/python3.10/site-packages/paddle/tensor/manipulation.py", line 1842, in stack
    return _C_ops.stack(x, axis)
MemoryError:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   stack_ad_func(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, int)
1   paddle::experimental::stack(std::vector<paddle::Tensor, std::allocator<paddle::Tensor> > const&, int)
2   void phi::funcs::LaunchStackKernel<phi::GPUContext, float, long, (phi::funcs::SegmentedArraySize)4>(phi::GPUContext const&, long, long, long, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, phi::DenseTensor*)
3   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
4   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
12  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
13  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

cascade_mask_fpn_reader.yml

worker_num: 4
TrainReader:
  sample_transforms:
  - Decode: {}
  - RandomDistort: {}

  - Mosaic:
      prob: 1.0
      input_dim: [1300, 1300]
      degrees: [-10, 10]
      scale: [0.1, 2.0]
      shear: [-2, 2]
      translate: [-0.1, 0.1]
      enable_mixup: True
      mixup_prob: 1.0
      mixup_scale: [0.5, 1.5]

  - RandomResize: {target_size: [[640, 1333], [672, 1333], [704, 1333], [736, 1333], [768, 1333], [800, 1333]], interp: 2, keep_ratio: True}
  - RandomFlip: {prob: 0.5}
  - NormalizeImage: {is_scale: true, mean: [0.485,0.456,0.406], std: [0.229, 0.224,0.225]}
  - Permute: {}
  batch_transforms:
  - PadBatch: {pad_to_stride: 32}
  batch_size: 1
  shuffle: true
  drop_last: true
  collate_batch: false
  use_shared_memory: true

EvalReader:
  sample_transforms:
  - Decode: {}
  - Resize: {interp: 2, target_size: [800, 1333], keep_ratio: True}
  - NormalizeImage: {is_scale: true, mean: [0.485,0.456,0.406], std: [0.229, 0.224,0.225]}
  - Permute: {}
  batch_transforms:
  - PadBatch: {pad_to_stride: 32}
  batch_size: 1
  shuffle: false
  drop_last: false

cascade_mask_rcnn_r50_vd_fpn_ssld_2x_fp_zones.yml

_BASE_: [
  '../datasets/fp_zones.yml',
  '../runtime.yml',
  '_base_/optimizer_1x.yml',
  '_base_/cascade_mask_rcnn_r50_fpn.yml',
  '_base_/cascade_mask_fpn_reader.yml',
]
pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/ResNet50_vd_ssld_v2_pretrained.pdparams
weights: output/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco/model_final

ResNet:
  depth: 50
  variant: d
  norm_type: bn
  freeze_at: 0
  return_idx: [0,1,2,3]
  num_stages: 4
  lr_mult_list: [0.05, 0.05, 0.1, 0.15]

epoch: 150
LearningRate:
  base_lr: 0.01
  schedulers:
  - !PiecewiseDecay
    gamma: 0.1
    milestones: [125, 140]
  - !LinearWarmup
    start_factor: 0.1
    steps: 1000

复现环境 Environment

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

ExtReMLapin commented 7 months ago

I'm trying to debug it, I added print debug code inside paste_mask from the stack

print('x0_int, y0_int, x1_int, y1_int:', x0_int, y0_int, x1_int, y1_int)

And it prints :


x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       2848)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       2000) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       3000)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       2848)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
       2848)

before crashing, shouldn't the resolution be caped down because of the EvalReader Resize ?

ExtReMLapin commented 7 months ago

Alright, after debugging, a little more I feel like the image is correctly capped when sent to gpu but later scalled back up, which is weird and fills gpu memory.

ExtReMLapin commented 7 months ago

I can now confirm it's caused by the raw images resolutions, and it's probably trying to scale up the predicted mask to the original image resolution.

After resizing all my dataset images to 1333, issue is gone. Still a bug tho.

LokeZhou commented 6 months ago

You can try adding padding. device. cuda. empty_cache(). after each epoch training

ExtReMLapin commented 6 months ago

You misunderstand the issue, it's the validation process ITSELF that causes the OOM error.

With a batch size of 1 it should NOT use more than 1 gig of VRAM and here, because of the original resolution (even scalled down with augmentations) of the validation images, it OOM. Padding won't help as it add pixels.

After detecting on scaled down resolution, for some reasons, the mask is scalled UP to the original image resolution, and this is where it CUDA OOM.