PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.79k stars 2.89k forks source link

单机多卡训练rt-detrv2-r101,loss反向传播报错ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr. #9094

Closed DoctorDream closed 6 days ago

DoctorDream commented 2 months ago

问题确认 Search before asking

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

当我使用下述指令训练rt-detr的时候:

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

会出现报错:

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 209, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 205, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection/tools/train.py", line 158, in run
    trainer.train(FLAGS.eval)
  File "/home/zqy/zqy/Codes/PaddleDetection/ppdet/engine/trainer.py", line 614, in train
    loss.backward()
  File "/usr/local/lib/python3.10/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/framework.py", line 593, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/paddle/base/dygraph/tensor_patch_methods.py", line 342, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

当我使用单卡训练的时候就不会报错了

复现环境 Environment

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

Sunting78 commented 2 months ago

您好,可以切换release/2.7.1试一下

DoctorDream commented 2 months ago

您好,可以切换release/2.7.1试一下

您好,我切换到PaddleDetection:release/2.7.1分支后,configs中并没有rtdetrv2文件夹,当我按照 release/2.7.0 分支的下述指令:

python -m paddle.distributed.launch --gpus 0,1,2 tools/train.py -c configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml --fleet --eval

出现了如下报错:

Traceback (most recent call last):
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 213, in <module>
    main()
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 209, in main
    run(FLAGS, cfg)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/tools/train.py", line 149, in run
    trainer = Trainer(cfg, mode='train')
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/engine/trainer.py", line 116, in __init__
    self.model = create(cfg.architecture)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 255, in create
    cls_kwargs.update(cls.from_config(config, **kwargs))
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/modeling/architectures/detr.py", line 63, in from_config
    transformer = create(cfg['transformer'], **kwargs)
  File "/home/zqy/zqy/Codes/PaddleDetection-release-2.7.1/ppdet/core/workspace.py", line 229, in create
    raise ValueError("The module {} is not registered".format(name))
ValueError: The module RTDETRTransformerv2 is not registered
lyuwenyu commented 2 months ago

先用v1跑一下同样的数据是不是有问题

DoctorDream commented 2 months ago

先用v1跑一下同样的数据是不是有问题

很抱歉,我之前填写运行环境时出现了错误,现在更正如下:

复现环境 Environment

经过试验,在该环境下跑多卡跑rtdetr是没问题的,但是多卡跑rtdetr v2时会出现上述:

ValueError: (InvalidArgument) Required tensor shall not be nullptr, but received nullptr.
  [Hint: tensor should not be null.] (at ../paddle/phi/core/device_context.cc:142)

的报错

DoctorDream commented 2 months ago

先用v1跑一下同样的数据是不是有问题

请问这个问题短期内有解决方案吗?辛苦您了

lyuwenyu commented 2 months ago

收到 最近安排时间看下;其实v1v2的第一阶段训练没啥区别

yski commented 1 month ago

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

DoctorDream commented 1 month ago

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

zhang-prog commented 1 month ago

@lyuwenyu 大佬看下呢

yski commented 1 month ago

先用v1跑一下同样的数据是不是有问题

大佬,导出问题看看吧,v2训练推理都没问题,但是导出报错,paddle3.0b1+paddledetection develop

请问你是多卡训练也没问题吗

没有测试,我一直用的windows单卡

zhang-prog commented 6 days ago

The issue has no response for a long time and will be closed. You can reopen or new another issue if are still confused.