(InvalidArgument) Sum of Attr(num_or_sections) must be equal to the input's size along the split dimension.

guanshanjushi commented 1 year ago

问题确认 Search before asking

[X] 我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

我在训练rtdetr的时候出现一下问题： INFO 2023-04-25 14:15:35,478 utils.py:148] Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. loading annotations into memory... Done (t=0.58s) creating index... index created! [04/25 14:15:37] ppdet.data.source.coco INFO: Load [4849 samples valid, 1 samples invalid] in file /home/wxp/wxp_dataset/newcityevent/验证训练集/dataset_科技部课题/train/coco_train.json. W0425 14:15:37.660985 65296 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 10.2 W0425 14:15:37.663920 65296 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6. [04/25 14:15:39] ppdet.utils.checkpoint INFO: ['fc.bias', 'fc.weight', 'last_conv.weight'] in pretrained weight is not used in the model, and its will not be loaded [04/25 14:15:39] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/wxp/project_wxp/github/YOLO/PaddleDetection/pretrain_weights/PPHGNetV2_L_ssld_pretrained.pdparams Traceback (most recent call last): File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/tools/train.py", line 204, in main() File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/tools/train.py", line 200, in main run(FLAGS, cfg) File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/tools/train.py", line 153, in run trainer.train(FLAGS.eval) File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/engine/trainer.py", line 542, in train outputs = model(data) File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, kwargs) File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/modeling/architectures/detr.py", line 113, in get_loss return self._forward() File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/modeling/architectures/detr.py", line 87, in _forward out_transformer = self.transformer(body_feats, pad_mask, self.inputs) File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(inputs, kwargs) File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, *kwargs) File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/modeling/transformers/rtdetr_transformer.py", line 442, in forward get_contrastive_denoising_training_group(gt_meta, File "/home/wxp/project_wxp/github/YOLO/PaddleDetection/ppdet/modeling/transformers/utils.py", line 258, in get_contrastive_denoising_training_group dn_positive_idx = paddle.split(dn_positive_idx, File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/tensor/manipulation.py", line 954, in split return paddle.fluid.layers.split( File "/home/wxp/anaconda3/lib/python3.9/site-packages/paddle/fluid/layers/nn.py", line 5097, in split _C_ops.split(input, out, attrs) ValueError: (InvalidArgument) Sum of Attr(num_or_sections) must be equal to the input's size along the split dimension. But received Attr(num_or_sections) = [80, 52], input(X)'s shape = [1638400], Attr(dim) = 0. [Hint: Expected sum_of_section == input_axis_dim, but received sum_of_section:132 != input_axis_dim:1638400.] (at /paddle/paddle/fluid/operators/split_op.h:100) [operator < split > error]

复现环境 Environment

OS: ubuntu20.04
paddlepaddle: 2.3.2
cuda: 10.2
cudnn: 7.6
gcc: 7.5.0

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[x] 我愿意提交PR！I'd like to help by submitting a PR!

guanshanjushi commented 1 year ago

使用自己的数据集进行训练，修改了num_class，同时修改了batch_size=2,worker_num=2: worker_num: 2 TrainReader: sample_transforms:

Decode: {}
RandomDistort: {prob: 0.8}
RandomExpand: {fill_value: [123.675, 116.28, 103.53]}
RandomCrop: {prob: 0.8}
RandomFlip: {} batch_transforms:
BatchRandomResize: {target_size: [480, 512, 544, 576, 608, 640, 640, 640, 672, 704, 736, 768, 800], random_size: True, random_interp: True, keep_ratio: False}
NormalizeImage: {mean: [0., 0., 0.], std: [1., 1., 1.], norm_type: none}
NormalizeBox: {}
BboxXYXY2XYWH: {}
Permute: {} batch_size: 2 shuffle: true drop_last: true collate_batch: false use_shared_memory: false

EvalReader: sample_transforms:

Decode: {}
Resize: {target_size: [640, 640], keep_ratio: False, interp: 2}
NormalizeImage: {mean: [0., 0., 0.], std: [1., 1., 1.], norm_type: none}
Permute: {} batch_size: 2 shuffle: false drop_last: false

TestReader: inputs_def: image_shape: [3, 640, 640] sample_transforms:

Decode: {}
Resize: {target_size: [640, 640], keep_ratio: False, interp: 2}
NormalizeImage: {mean: [0., 0., 0.], std: [1., 1., 1.], norm_type: none}
Permute: {} batch_size: 1 shuffle: false drop_last: false

guanshanjushi commented 1 year ago

找到原因了，由于我反复测试，主要是因为paddle的cuda版本问题，最新的rtdetr采用cuda10.2版本的paddle训练时会出现以上问题，但采用cuda11.1版本以后就不会出现以上问题，同时要注意cudnn的版本要和paddle要求的版本一致即可。愿所有人都不被环境配置所困扰

PaddlePaddle / PaddleDetection