paddle在华为npu显卡安装问题

Baymax0525 commented 1 year ago

问题描述 Issue Description

我是用的docker镜像是：https://hub.docker.com/r/paddlepaddle/paddle/tags?page=1&name=cann的latest-dev-cann5.0.2.alpha005-gcc82-x86_64 但是官网只提供了基于arm架构的编译方式，如下： https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/hardware_support/npu_docs/paddle_install_cn.html 由于我用的机器是x86架构，所以编译参数去掉了-DWITH_ARM=ON，完整的编译参数是 cmake .. -DPY_VERSION=3.7 -DWITH_ASCEND=OFF -DWITH_ASCEND_CL=ON -DWITH_ASCEND_INT64=ON -DWITH_DISTRIBUTE=ON -DWITH_TESTING=ON -DON_INFER=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXPORT_COMPILE_COMMANDS=ON 编译完成后得到的paddle安装包是paddlepaddle_npu-0.0.0-cp37-cp37m-linux_x86_64.whl pip安装后，import paddle提示信息如下 UserWarning: We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. 在https://gitee.com/ascend/modelzoo/issues/I571T4#note_13626125看到有人编译安装成功了，但是我没有成功，有哪位同学可以帮忙看一下吗？谢谢

版本&环境信息 Version & Environment Information

docker 镜像安装

paddle-bot[bot] commented 1 year ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

sljlp commented 1 year ago

这个warning在npu上可以忽略，说暂时不支持eager模式，但是一般不会影响跑程序。请问具体遇到什么错误了呢？

Baymax0525 commented 1 year ago

这个warning在npu上可以忽略，说暂时不支持eager模式，但是一般不会影响跑程序。请问具体遇到什么错误了呢？

/opt/conda/lib/python3.7/site-packages/paddle/fluid/framework.py:189: UserWarning: We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. "We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. " Running verify PaddlePaddle program ... I1014 09:06:06.187234 54891 interpretercore.cc:235] New Executor is Running. Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/paddle/utils/install_check.py", line 266, in run_check _run_static_single(use_cuda, use_xpu, use_npu) File "/opt/conda/lib/python3.7/site-packages/paddle/utils/install_check.py", line 170, in _run_static_single exe.run(startup_prog) File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1461, in run six.reraise(*sys.exc_info()) File "/opt/conda/lib/python3.7/site-packages/six.py", line 703, in reraise raise value File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1457, in run return_merged=return_merged) File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1666, in _run_impl return_numpy) File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 630, in run fetch_list)._move_to_list() MemoryError: In user code:

File "<string>", line 1, in <module>

File "/opt/conda/lib/python3.7/site-packages/paddle/utils/install_check.py", line 266, in run_check
  _run_static_single(use_cuda, use_xpu, use_npu)
File "/opt/conda/lib/python3.7/site-packages/paddle/utils/install_check.py", line 156, in _run_static_single
  input, out, weight = _simple_network()
File "/opt/conda/lib/python3.7/site-packages/paddle/utils/install_check.py", line 33, in _simple_network
  attr=paddle.ParamAttr(initializer=paddle.nn.initializer.Constant(0.1)))
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/layers/tensor.py", line 151, in create_parameter
  default_initializer)
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 383, in create_parameter
  **attr._to_kwargs(with_initializer=True))
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3790, in create_parameter
  initializer(param, self)
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 54, in __call__
  return self.forward(param, block)
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 191, in forward
  stop_gradient=True)
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3840, in append_op
  attrs=kwargs.get("attrs", None))
File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2748, in __init__
  for frame in traceback.extract_stack():

ResourceExhaustedError: Not enough available NPU memory.
  [Hint: Expected available_to_alloc > 0, but received available_to_alloc:0 <= 0:0.] (at /home/longiuser/workspace/quanjia/Paddle/paddle/fluid/platform/device/npu/npu_info.cc:160)
  [operator < fill_constant > error]

我查看npu-smi info 显卡是正常的 +------------------------------------------------------------------------------+ | npu-smi 21.0.2 Version: 21.0.2 | +-------------------+-----------------+----------------------------------------+ | NPU Name | Health | Power(W) Temp(C) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===================+=================+========================================+ | 976 310 | OK | 12.8 59 | | 0 0 | 0000:3D:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 992 310 | OK | 12.8 60 | | 0 1 | 0000:3E:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 1008 310 | OK | 12.8 62 | | 0 2 | 0000:3F:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 1024 310 | OK | 12.8 60 | | 0 3 | 0000:40:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 2176 310 | OK | 12.8 63 | | 0 4 | 0000:88:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 2192 310 | OK | 12.8 63 | | 0 5 | 0000:89:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 2208 310 | OK | 12.8 62 | | 0 6 | 0000:8A:00.0 | 0 2703 / 8192 | +===================+=================+========================================+ | 2224 310 | OK | 12.8 63 | | 0 7 | 0000:8B:00.0 | 0 2703 / 8192 | +===================+=================+========================================+

ronny1996 commented 1 year ago

目前NPU支持的是ascend 910，310未适配

Baymax0525 commented 1 year ago

目前NPU支持的是ascend 910，310未适配

这样啊，谢谢您。我刚接触npu，能查找到的资料很少，您有哪些可以查阅的资源可以分享一下吗？谢谢啦

Baymax0525 commented 1 year ago

我在ascend 910安装paddle成功了，但是提示不支持动态图（only support new eager dygraph mode on CPU/GPU）

(base) λ ai-training /PaddleSeg {release/2.6}  python -c "import paddle; paddle.utils.run_check()"

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
/opt/conda/lib/python3.7/site-packages/paddle/fluid/framework.py:189: UserWarning: We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. 
  "We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. "
Running verify PaddlePaddle program ... 
I1017 11:12:41.262614 57120 interpretercore.cc:235] New Executor is Running.
I1017 11:12:48.488742 57120 interpretercore_util.cc:430] Standalone Executor is Used.
PaddlePaddle works well on 1 NPU.
PaddlePaddle works well on 1 NPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

sljlp commented 1 year ago

新动态图模式在NPU上尚未支持，但是也可以使用旧动态图正常使用paddle训练，请问有没有遇到具体报错呢？

Baymax0525 commented 1 year ago

eager模式在NPU上尚未支持，但是也可以正常使用paddle训练，请问有没有遇到具体报错呢？

没有报错，只是训练时间与在CPU上一样长。npu-smi info查询显卡使用情况如下

+------------------------------------------------------------------------------------+
| npu-smi 1.8.20                   Version: 20.2.2                                   |
+----------------------+---------------+---------------------------------------------+
| NPU   Name           | Health        | Power(W)   Temp(C)                          |
| Chip                 | Bus-Id        | AICore(%)  Memory-Usage(MB)  HBM-Usage(MB)  |
+======================+===============+=============================================+
| 1     910B           | OK            | 82.7       79                               |
| 0                    | 0000:3B:00.0  | 0          2170 / 15505      0    / 32255   |
+======================+===============+=============================================+

sljlp commented 1 year ago

在程序开始时调用 paddle.device.set_device("npu")

Baymax0525 commented 1 year ago

在程序开始时调用 paddle.device.set_device("npu") 我用的paddleseg测试的，增加了--device=npu,显卡的确在使用，但是训练时间还是没有减小。我用的是paddle2.3编译安装的，请问和版本有关系吗？ python train.py --config configs/quick_start/pp_liteseg_optic_disc_512x512.yml --device=npu --iters 6000 --do_eval --save_interval 20 --save_dir output/pp_liteseg_optic_disc
+------------------------------------------------------------------------------------+
| npu-smi 1.8.20                   Version: 20.2.2                                   |
+----------------------+---------------+---------------------------------------------+
| NPU   Name           | Health        | Power(W)   Temp(C)                          |
| Chip                 | Bus-Id        | AICore(%)  Memory-Usage(MB)  HBM-Usage(MB)  |
+======================+===============+=============================================+
| 1     910B           | OK            | 83.3       80                               |
| 0                    | 0000:3B:00.0  | 3          2325 / 15505      28384/ 32255   |
+======================+===============+=============================================+

qili93 commented 1 year ago

hi, @Baymax0525

由于 CANN 算子库中缺少了很多 PaddleSeg 模型所需的算子，因此目前 PaddleSeg 类型的模型中存在较多算子尚未有 NPU 的算子实现，因此此类模型功能上能跑，但是缺失的算子会默认 fallback 到 CPU 上运行，这会导致实际模型运行性能很差，基本是接近 CPU 的性能水平。

后续我们会尝试联系华为 CANN 开发这部分缺失算子，以补齐 PaddleSeg 的缺失算子。同时，目前我们已经升级到 CANN 512的最新版本了，可以尝试使用这个 registry.baidubce.com/device/paddle-npu:cann512-x86_64-gcc82 这个镜像，后续我们也会同步更新飞桨在昇腾上的最新模型和镜像到飞桨官网。

谢谢！

Baymax0525 commented 1 year ago

PaddlePaddle / Paddle

paddle在华为npu显卡安装问题 #47007

问题描述 Issue Description

版本&环境信息 Version & Environment Information