将fairmot的backbone替换成x101后，训练正常，eval_mot会疯狂的输出

zouhan6806504 commented 2 months ago

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

配置文件如下

use_gpu: true
use_xpu: false
use_mlu: false
use_npu: false
log_iter: 100
save_dir: /home/aistudio/output
#save_dir: /home/aistudio/output
snapshot_epoch: 1
print_flops: false
print_params: false

# Exporting the model
export:
  post_process: True  # Whether post-processing is included in the network when export model.
  nms: True           # Whether NMS is included in the network when export model.
  benchmark: False    # It is used to testing model performance, if set `True`, post-process and NMS will not be exported.
  fuse_conv_bn: False

metric: MCMOT
num_classes: 228
#/home/aistudio/data/mot
TrainDataset:
  !MCMOTDataSet
    dataset_dir: /home/aistudio/data/mot # 需要更改为自己对应的文件目录下
    image_lists: ['IKCEST.train']
    data_fields: ['image', 'gt_bbox', 'gt_class', 'gt_ide']
    label_list: /home/aistudio/data/mot/label_list.txt

EvalMOTDataset:
  !MOTImageFolder
    dataset_dir: /home/aistudio/data/mot
    data_root: IKCEST/images/test/
    keep_ori_im: False # set True if save visualization images or video, or used in DeepSORT
    anno_path: /home/aistudio/data/mot/label_list.txt

TestMOTDataset:
  !MOTImageFolder
    dataset_dir: /home/aistudio/data/mot/IKCEST/images/test/
    keep_ori_im: True # set True if save visualization images or video
    anno_path: /home/aistudio/data/mot/label_list.txt

#pretrain_weights: https://paddledet.bj.bcebos.com/models/centernet_dla34_140e_coco.pdparams
pretrain_weights: https://paddledet.bj.bcebos.com/models/pretrained/ResNeXt101_vd_64x4d_pretrained.pdparams
architecture: FairMOT
for_mot: True

FairMOT:
  detector: CenterNet
  reid: FairMOTEmbeddingHead
  loss: FairMOTLoss
  tracker: JDETracker # multi-class tracker

CenterNet:
  backbone: ResNet
  neck: CenterNetDLAFPN
  head: CenterNetHead
  post_process: CenterNetPostProcess

ResNet:
  # for ResNeXt: groups, base_width, base_channels
  depth: 101
  groups: 64
  base_width: 4
  variant: d
  norm_type: bn
  freeze_at: 0
  return_idx: [0,1,2,3]
  num_stages: 4
  dcn_v2_stages: [1,2,3]

CenterNetDLAFPN:
  down_ratio: 4
  last_level: 3
  out_channel: 0
  first_level: 0
  dcn_v2: True
  with_sge: True

CenterNetHead:
  head_planes: 256
  prior_bias: -2.19
  regress_ltrb: False
  size_loss: 'L1'
  loss_weight: {'heatmap': 1.0, 'size': 0.1, 'offset': 1.0, 'iou': 0.0}
  add_iou: False

FairMOTEmbeddingHead:
  ch_head: 256
  ch_emb: 128

CenterNetPostProcess:
  max_per_img: 200
  down_ratio: 4
  regress_ltrb: False

JDETracker:
  conf_thres: 0.4
  tracked_thresh: 0.4
  metric_type: cosine
  min_box_area: 0
  vertical_ratio: 0 # for pedestrian
  use_byte: True
  match_thres: 0.8
  low_conf_thres: 0.2

weights: /home/aistudio/output/dla
#weights: /home/aistudio/output/dla

epoch: 30
LearningRate:
  base_lr: 0.00025
  schedulers:
    - name: CosineDecay
      max_epochs: 36
    - name: LinearWarmup
      start_factor: 0.
      epochs: 1

OptimizerBuilder:
  regularizer: false
  optimizer:
    type: AdamW
    weight_decay: 0.0001
    param_groups:
      - params: ['absolute_pos_embed', 'relative_position_bias_table', 'norm']
        weight_decay: 0.0

worker_num: 1
TrainReader:
  inputs_def:
    image_shape: [3, 608, 1088]
  sample_transforms:
    - Decode: {}
    - RGBReverse: {}
    - AugmentHSV: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - MOTRandomAffine: {reject_outside: False}
    - RandomFlip: {}
    - BboxXYXY2XYWH: {}
    - NormalizeBox: {}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1]}
    - RGBReverse: {}
    - Permute: {}
  batch_transforms:
    - Gt2FairMOTTarget: {}
  batch_size: 4
  shuffle: True
  drop_last: True
  use_shared_memory: True

EvalMOTReader:
  sample_transforms:
    - Decode: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1], is_scale: True}
    - Permute: {}
  batch_size: 1

TestMOTReader:
  inputs_def:
    image_shape: [3, 608, 1088]
  sample_transforms:
    - Decode: {}
    - LetterBoxResize: {target_size: [608, 1088]}
    - NormalizeImage: {mean: [0, 0, 0], std: [1, 1, 1], is_scale: True}
    - Permute: {}
  batch_size: 1

eval的时候会疯狂输出warning

Warning:: 0D Tensor cannot be used as 'Tensor.numpy()[0]' . In order to avoid this problem, 0D Tensor will be changed to 1D numpy currently, but it's not correct and will be removed in release 2.6. For Tensor contain only one element, Please modify 'Tensor.numpy()[0]' to 'float(Tensor)' as soon as possible, otherwise 'Tensor.numpy()[0]' will raise error in release 2.6.
I0913 17:10:51.835693 264737 eager_method.cc:140] Warning:: 0D Tensor cannot be used as 'Tensor.numpy()[0]' . In order to avoid this problem, 0D Tensor will be changed to 1D numpy currently, but it's not correct and will be removed in release 2.6. For Tensor contain only one element, Please modify 'Tensor.numpy()[0]' to 'float(Tensor)' as soon as possible, otherwise 'Tensor.numpy()[0]' will raise error in release 2.6.
I0913 17:10:51.835812 264737 eager_method.cc:140]

像这种组装配件后还需要注意哪些地方要改动？

Bobholamovic commented 2 months ago

请提供你使用的PaddleDetection版本和Paddle版本，以便于我们排查问题。

zouhan6806504 commented 1 month ago

请提供你使用的PaddleDetection版本和Paddle版本，以便于我们排查问题。

项目“ikcest2024_notebook”共享链接(有效期三天)：https://aistudio.baidu.com/studio/project/partial/verify/8294004/4a9858342afa488d8eac9acd98c09667 按序执行即可看见 detection2.7 paddle2.5

Bobholamovic commented 1 month ago

可能需要将这两处的x.numpy()[0]修改为np.array(x.numpy())[0]： https://github.com/search?q=repo%3APaddlePaddle%2FPaddleDetection%20%22.numpy()%5B0%5D%22&type=code

zouhan6806504 commented 1 month ago

可能需要将这两处的x.numpy()[0]修改为np.array(x.numpy())[0]： https://github.com/search?q=repo%3APaddlePaddle%2FPaddleDetection%20%22.numpy()%5B0%5D%22&type=code

这几处改了之后依然会有“0D Tensor cannot be used as 'Tensor.numpy()[0]'” 我尝试着把paddle版本改成2.6的，结果log确实没有了，但是生成的txt文件全部是空的

Bobholamovic commented 1 month ago

请问日志里有报错信息吗？

zouhan6806504 commented 1 month ago

请问日志里有报错信息吗？

改成paddle2.6后，日志没报错

Bobholamovic commented 1 month ago

如果没有报错的话，那结果为空会不会是模型效果问题呀？请问是否使用了自己的数据呢，以及模型在验证集上的精度如何？

zouhan6806504 commented 1 month ago

如果没有报错的话，那结果为空会不会是模型效果问题呀？请问是否使用了自己的数据呢，以及模型在验证集上的精度如何？

训练的数据都是一样的，感觉不是模型的问题，训练的时候--eval和dla34对比过的，loss比dla34低，而且我只是在fairmot_dla34的基础上换了个backbone而已DLA->ResNet101

zouhan6806504 commented 1 month ago

如果没有报错的话，那结果为空会不会是模型效果问题呀？请问是否使用了自己的数据呢，以及模型在验证集上的精度如何？

dla的loss最少都有4.4左右，resnet101能到4.2 下面是resnet101的训练log

[09/12 09:28:19] ppdet.engine INFO: Epoch: [13] [10500/10687] learning_rate: 0.000174 loss: 4.230835 heatmap_loss: 0.332204 size_loss: 0.522570 offset_loss: 0.156625 det_loss: 0.551854 reid_loss: 1159.913696 eta: 5 days, 7:38:55 batch_cost: 2.6880 data_cost: 0.0003 ips: 1.4881 images/s
[09/12 09:32:47] ppdet.engine INFO: Epoch: [13] [10600/10687] learning_rate: 0.000174 loss: 4.205076 heatmap_loss: 0.312019 size_loss: 0.504238 offset_loss: 0.155205 det_loss: 0.523159 reid_loss: 1159.933716 eta: 5 days, 7:34:25 batch_cost: 2.6793 data_cost: 0.0003 ips: 1.4929 images/s

dla的

[09/10 13:17:44] ppdet.engine INFO: Epoch: [13] [5200/5343] learning_rate: 0.000279 loss: 4.377353 heatmap_loss: 0.517096 size_loss: 0.592028 offset_loss: 0.163601 det_loss: 0.738086 reid_loss: 1160.759399 eta: 4 days, 0:43:39 batch_cost: 2.8901 data_cost: 1.8178 ips: 2.7681 images/s
[09/10 13:22:36] ppdet.engine INFO: Epoch: [13] [5300/5343] learning_rate: 0.000279 loss: 4.386300 heatmap_loss: 0.523919 size_loss: 0.597360 offset_loss: 0.163826 det_loss: 0.752244 reid_loss: 1160.773315 eta: 4 days, 0:26:39 batch_cost: 2.9197 data_cost: 1.8222 ips: 2.7400 images/s

两者的batch因为显存的关系差1倍

Bobholamovic commented 1 month ago

请问是在什么数据上测试结果为空呢？

lyuwenyu commented 1 month ago

dla输出结果正常嘛

TingquanGao commented 5 days ago

The issue has no response for a long time and will be closed. You can reopen or new another issue if are still confused.

From Bot

PaddlePaddle / PaddleDetection

将fairmot的backbone替换成x101后，训练正常，eval_mot会疯狂的输出 #9139

问题确认 Search before asking

请提出你的问题 Please ask your question