Unable to learn `RT-DETR` because of `pretrain_weights`.

seareale commented 1 year ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

I excuted a command : python tools/train.py -c configs/rtdetr/rtdetr_r50vd_6x_coco.yml --eval So, pretrained weights in the file rtdetr_r50vd.yml was downloaded. https://github.com/PaddlePaddle/PaddleDetection/blob/78c6b82fbcd633bdf6f27fa12d820a3581770ca5/configs/rtdetr/_base_/rtdetr_r50vd.yml#L2

But I got a error like below.

[04/24 06:55:56] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/haejin/.cache/paddle/weights/ResNet50_vd_ssld_v2_pretrained.pdparams
Traceback (most recent call last):
File "/home/haejin/workspace/PaddleDetection/tools/train.py", line 204, in <module>
    main()
  File "/home/haejin/workspace/PaddleDetection/tools/train.py", line 200, in main
    run(FLAGS, cfg)
  File "/home/haejin/workspace/PaddleDetection/tools/train.py", line 153, in run
    trainer.train(FLAGS.eval)
  File "/home/haejin/workspace/PaddleDetection/ppdet/engine/trainer.py", line 542, in train
    outputs = model(data)
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/haejin/workspace/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward
    out = self.get_loss()
  File "/home/haejin/workspace/PaddleDetection/ppdet/modeling/architectures/detr.py", line 113, in get_loss
    return self._forward()
  File "/home/haejin/workspace/PaddleDetection/ppdet/modeling/architectures/detr.py", line 87, in _forward
    out_transformer = self.transformer(body_feats, pad_mask, self.inputs)
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/haejin/workspace/CapAI/train/lesion_detection/paddle/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 442, in forward
    get_contrastive_denoising_training_group(gt_meta,
  File "/home/haejin/workspace/CapAI/train/lesion_detection/paddle/PaddleDetection/ppdet/modeling/transformers/utils.py", line 258, in get_contrastive_denoising_training_group
    dn_positive_idx = paddle.split(dn_positive_idx,
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/tensor/manipulation.py", line 954, in split
    return paddle.fluid.layers.split(
  File "/home/haejin/miniconda3/envs/paddle/lib/python3.10/site-packages/paddle/fluid/layers/nn.py", line 5097, in split
    _C_ops.split(input, out, *attrs)
ValueError: (InvalidArgument) Sum of Attr(num_or_sections) must be equal to the input's size along the split dimension. But received Attr(num_or_sections) = [100, 100, 100, 100], input(X)'s shape = [1638400], Attr(dim) = 0.
  [Hint: Expected sum_of_section == input_axis_dim, but received sum_of_section:400 != input_axis_dim:1638400.] (at /paddle/paddle/fluid/operators/split_op.h:100)
  [operator < split > error]

How to solve the problem?

eltoto1219 commented 1 year ago

I get a similar problem while running the following command:

python tools/infer.py -c configs/rtdetr/rtdet r_r50vd_6x_coco.yml -o weights=https://bj.bcebos.com/v1/paddledet/models/rtdetr_r50 vd_6x_coco.pdparams --infer_dir ~/rf_calibration_images/

W0424 15:22:18.319972 832877 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability:
 8.0, Driver API Version: 12.0, Runtime API Version: 10.2
W0424 15:22:18.323467 832877 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
[04/24 15:22:28] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/antonio/.cache/
paddle/weights/rtdetr_r50vd_6x_coco.pdparams
[04/24 15:22:28] train INFO: Found 500 inference images in total.
[04/24 15:22:28] ppdet.data.source.category WARNING: anno_file 'dataset/coco/annotations/instance
s_val2017.json' is None or not set or not exist, please recheck TrainDataset/EvalDataset/TestData
set.anno_path, otherwise the default categories will be used by metric_type.
[04/24 15:22:28] ppdet.data.source.category WARNING: metric_type: COCO, load default categories o
f COCO.
  0%|
                    | 0/500 [12:41<?, ?it/s]
Traceback (most recent call last):
  File "tools/infer.py", line 237, in <module>
    main()
  File "tools/infer.py", line 233, in main
    run(FLAGS, cfg)
  File "tools/infer.py", line 183, in run
    trainer.predict(
  File "/home/antonio/PaddleDetection/ppdet/engine/trainer.py", line 991, in predict
    outs = self.model(data)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/antonio/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 76, in for
ward
    outs.append(self.get_pred())
  File "/home/antonio/PaddleDetection/ppdet/modeling/architectures/detr.py", line 116, in get_pre
d
    return self._forward()
  File "/home/antonio/PaddleDetection/ppdet/modeling/architectures/detr.py", line 87, in _forward
    out_transformer = self.transformer(body_feats, pad_mask, self.inputs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/antonio/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 45
7, in forward
    out_bboxes, out_logits = self.decoder(
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/antonio/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 23
0, in forward
    output = layer(output, ref_points_input, memory,
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/antonio/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 18
9, in forward
    tgt2 = self.cross_attn(
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/dygraph/layer
s.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/antonio/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 10
4, in forward
    output = self.ms_deformable_attn_core(
  File "/home/antonio/PaddleDetection/ppdet/modeling/transformers/utils.py", line 89, in deformab
le_attention_core_func
    value_list = value.split(split_shape, axis=1)
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/tensor/manipulation
.py", line 954, in split
    return paddle.fluid.layers.split(
  File "/home/antonio/anaconda3/envs/torch/lib/python3.8/site-packages/paddle/fluid/layers/nn.py"
, line 5097, in split
    _C_ops.split(input, out, *attrs)
ValueError: (InvalidArgument) Sum of Attr(num_or_sections) must be equal to the input's size alon
g the split dimension. But received Attr(num_or_sections) = [901165597, -1115482905, 1007894246],
 input(X)'s shape = [1, 8400, 8, 32], Attr(dim) = 1.
  [Hint: Expected sum_of_section == input_axis_dim, but received sum_of_section:793576938 != inpu
t_axis_dim:8400.] (at /paddle/paddle/fluid/operators/split_op.h:100)
  [operator < split > error]

Python version: 3.8.16 PaddlePaddleVersion: 2.3.2

lyuwenyu commented 1 year ago

The problem you describe cannot be reproduced, please make sure you are using the latest code.

seareale commented 1 year ago

@lyuwenyu I'm sorry I'm late here. Thank you for applying. I used my custom dataset converted COCO format. There was no problem using it in other projects. And I used the latest commit state of develop branch. All parameters except dataset are default values.

PhD-TianLv commented 1 year ago

@lyuwenyu @seareale

错误原因：paddle 2.3版本当tensor的值为0时会发生内存泄露，见下图

Screenshot 2023-05-11 at 23 43 40

尝试的解决方案：方案1：更新paddlepaddle为2.4版本，但出现新的错误

W0512 00:07:35.062287 513134 system_allocator.cc:234] cudaHostAlloc failed.
W0512 00:07:35.062326 513134 naive_best_fit_allocator.cc:591] cudaHostAlloc Cannot allocate 32 bytes in CUDAPinnedPlace

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()
1   std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)
2   void paddle::memory::Copy<phi::GPUPinnedPlace, phi::Place>(phi::GPUPinnedPlace, void*, phi::Place, void const*, unsigned long)
3   void paddle::memory::Copy<phi::Place, phi::Place>(phi::Place, void*, phi::Place, void const*, unsigned long, void*)
4   void paddle::memory::Copy<phi::GPUPinnedPlace, phi::CPUPlace>(phi::GPUPinnedPlace, void*, phi::CPUPlace, void const*, unsigned long)

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1683821255 (unix time) try "date -d @1683821255" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 512908 (TID 0x7f964f6c9700) from PID 0 ***]

Segmentation fault

方案2（待2023年5月12日上午尝试）：修改ppdet-modeling-transformers-utils.py中的get_denoising_training_group函数，将torch.zeros改为初始化bool值

PaddlePaddle / PaddleDetection

Unable to learn `RT-DETR` because of `pretrain_weights`. #8141

问题确认 Search before asking

请提出你的问题 Please ask your question