PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.75k stars 2.88k forks source link

[NPU] 单机多卡分布式训练提示RuntimeError: (NotFound) The kernel sync_batch_norm is not registered. #8441

Open 535205856 opened 1 year ago

535205856 commented 1 year ago

问题确认 Search before asking

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

单机多卡报错 python -m paddle.distributed.fleet.launch --run_mode=collective --npus="4,5,6,7" tools/train.py -c configs/yolov3/yolov3_darknet53_270e_roadsign.yml -o use_npu=True

单机单卡可以训练

单机多卡训练异常报错,

Traceback (most recent call last): File "tools/train.py", line 205, in main() File "tools/train.py", line 201, in main run(FLAGS, cfg) File "tools/train.py", line 151, in run trainer.train(FLAGS.eval) File "/workspace/PaddleDetection/ppdet/engine/trainer.py", line 539, in train outputs = model(data) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/distributed/parallel.py", line 534, in forward outputs = self._layers(*inputs, *kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(inputs, kwargs) File "/workspace/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/workspace/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 147, in get_loss return self.forward() File "/workspace/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 81, in forward body_feats = self.backbone(self.inputs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, kwargs) File "/workspace/PaddleDetection/ppdet/modeling/backbones/darknet.py", line 330, in forward out = self.conv0(x) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, *kwargs) File "/workspace/PaddleDetection/ppdet/modeling/backbones/darknet.py", line 77, in forward out = self.batch_norm(out) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(inputs, kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/norm.py", line 1557, in forward False, RuntimeError: (NotFound) The kernel sync_batch_norm is not registered. [Hint: Expected iter != kernels.end(), but received iter == kernels.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:219)

复现环境 Environment

宿主机机器环境是 昇腾910npu + 鲲鹏920 arm cpu 的 ubuntu 环境 镜像使用是npu文档中的镜像 registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-aarch64-gcc82

Bug描述确认 Bug description confirmation

是否愿意提交PR? Are you willing to submit a PR?

lyuwenyu commented 1 year ago

把sync_batch_norm关掉试一下能跑嘛

535205856 commented 1 year ago

把sync_batch_norm关掉试一下能跑嘛

PaddleCustomDevice 方面的回答说是npu不支持sync_bn算子, 改为普通bn算子, 我这边使用的是paddleDetection v2.6.0, 这个tag下的没有改,在release/2.6 下面是改了的,没想到这两个还不一样,,想问问paddlepaddle 的各个子类包detection检测对npu的支持到什么地步了,里面的网络模型的哪些算子是支持npu吗,哪些是不支持的,支持的算子的性能都是如何的,其中哪些网络模型是确定支持上没问题的,有没有哪里能详细看到这样的报告或者说明的

lyuwenyu commented 1 year ago

支持的情况只能咨询PaddleCustomDevice,, 套件也是用户