[NPU] 单机多卡分布式训练提示RuntimeError: (NotFound) The kernel sync_batch_norm is not registered.

535205856 commented 1 year ago

问题确认 Search before asking

[X] 我已经查询历史issue，没有发现相似的bug。I have searched the issues and found no similar bug report.

Bug组件 Bug Component

Training

Bug描述 Describe the Bug

单机多卡报错 python -m paddle.distributed.fleet.launch --run_mode=collective --npus="4,5,6,7" tools/train.py -c configs/yolov3/yolov3_darknet53_270e_roadsign.yml -o use_npu=True

单机单卡可以训练

单机多卡训练异常报错，

Traceback (most recent call last): File "tools/train.py", line 205, in main() File "tools/train.py", line 201, in main run(FLAGS, cfg) File "tools/train.py", line 151, in run trainer.train(FLAGS.eval) File "/workspace/PaddleDetection/ppdet/engine/trainer.py", line 539, in train outputs = model(data) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/distributed/parallel.py", line 534, in forward outputs = self._layers(*inputs, *kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(inputs, kwargs) File "/workspace/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 60, in forward out = self.get_loss() File "/workspace/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 147, in get_loss return self.forward() File "/workspace/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 81, in forward body_feats = self.backbone(self.inputs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, kwargs) File "/workspace/PaddleDetection/ppdet/modeling/backbones/darknet.py", line 330, in forward out = self.conv0(x) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(*inputs, *kwargs) File "/workspace/PaddleDetection/ppdet/modeling/backbones/darknet.py", line 77, in forward out = self.batch_norm(out) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/layers.py", line 1253, in call return self.forward(inputs, kwargs) File "/opt/py37env/lib/python3.7/site-packages/paddle/nn/layer/norm.py", line 1557, in forward False, RuntimeError: (NotFound) The kernel sync_batch_norm is not registered. [Hint: Expected iter != kernels.end(), but received iter == kernels.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:219)

复现环境 Environment

宿主机机器环境是昇腾910npu + 鲲鹏920 arm cpu 的 ubuntu 环境镜像使用是npu文档中的镜像 registry.baidubce.com/device/paddle-npu:cann601-ubuntu18-aarch64-gcc82

Bug描述确认 Bug description confirmation

[X] 我确认已经提供了Bug复现步骤、代码改动说明、以及环境信息，确认问题是可以复现的。I confirm that the bug replication steps, code change instructions, and environment information have been provided, and the problem can be reproduced.

是否愿意提交PR？ Are you willing to submit a PR?

[X] 我愿意提交PR！I'd like to help by submitting a PR!

lyuwenyu commented 1 year ago

把sync_batch_norm关掉试一下能跑嘛

535205856 commented 1 year ago

把sync_batch_norm关掉试一下能跑嘛

PaddleCustomDevice 方面的回答说是npu不支持sync_bn算子, 改为普通bn算子, 我这边使用的是paddleDetection v2.6.0, 这个tag下的没有改，在release/2.6 下面是改了的，没想到这两个还不一样，，想问问paddlepaddle 的各个子类包detection检测对npu的支持到什么地步了，里面的网络模型的哪些算子是支持npu吗，哪些是不支持的，支持的算子的性能都是如何的，其中哪些网络模型是确定支持上没问题的，有没有哪里能详细看到这样的报告或者说明的

lyuwenyu commented 1 year ago

支持的情况只能咨询PaddleCustomDevice，，套件也是用户

PaddlePaddle / PaddleDetection