PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.3k stars 5.62k forks source link

飞腾+麒麟v10+昆仑R200环境中编译 PaddlePaddle 报错 #68926

Open UnlimitedWand opened 1 month ago

UnlimitedWand commented 1 month ago

问题描述 Issue Description

运行环境: 官方提供的麒麟v10 arm64 docker镜像registry.baidubce.com/device/paddle-xpu:kylinv10-aarch64-gcc82-py310 编译版本:v3.0.0-beta1

1. cmake cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_CXX_FLAGS="-Wno-error -w" -DPY_VERSION=3.10 -DPYTHON_EXECUTABLE=which python3-DWITH_CUSTOM_DEVICE=OFF -DWITH_TESTING=OFF -DON_INFER=ON -DWITH_DISTRIBUTE=ON -DWITH_ARM=ON -DWITH_XPU=ON -DWITH_XPU_BKCL=ON -DWITH_AARCH64=ON -DWITH_XPU_XFT=ON -DWITH_XPU_XHPC=ON

2. make make TARGET=ARMV8 -j16

3.上述步骤执行后报错如下: /work/Paddle/paddle/phi/backends/xpu/xpu_context.cc: In member function ‘void phi::XPUContext::Impl::Init(int64_t, int64_t)’: /work/Paddle/paddle/phi/backends/xpu/xpu_context.cc:196:64: error: no matching function for call to ‘baidu::xpu::api::Context::set_overload_alloc(phi::XPUContext::Impl::Init(int64_t, int64_t)::<lambda(size_t)>&, phi::XPUContext::Impl::Init(int64_t, int64_t)::<lambda()>&, phi::XPUContext::Impl::Init(int64_t, int64_t)::<lambda()>&)’ overload_alloc_fn, overload_free_fn, overload_save_fn); ^ In file included from /work/Paddle/build/third_party/install/xpu/include/xpu/xdnn_types.h:14, from /work/Paddle/build/third_party/install/xpu/include/xpu/xdnn.h:5, from /work/Paddle/paddle/phi/backends/xpu/xpu_header.h:29, from /work/Paddle/paddle/phi/backends/xpu/xpu_context.h:23, from /work/Paddle/paddle/phi/backends/xpu/xpu_context.cc:15: /work/Paddle/build/third_party/install/xpu/include/xpu/refactor/context/newcontext.h:94:10: note: candidate: ‘void baidu::xpu::api::Context::set_overload_alloc(void* (*)(size_t), void (*)())’ void set_overload_alloc(void* (*overload_func)(size_t cnt), void (*overload_free)()) { ^~~~~~~~~~~~~~~~~~ /work/Paddle/build/third_party/install/xpu/include/xpu/refactor/context/newcontext.h:94:10: note: candidate expects 2 arguments, 3 provided make[2]: *** [paddle/phi/CMakeFiles/phi.dir/build.make:1574: paddle/phi/CMakeFiles/phi.dir/backends/xpu/xpu_context.cc.o] Error 1 make[2]: *** Waiting for unfinished jobs.... [ 15%] Building CXX object paddle/phi/CMakeFiles/phi.dir/kernels/funcs/math/tree2col.cc.o ^Cmake[2]: *** Deleting file 'paddle/phi/CMakeFiles/phi.dir/kernels/funcs/eigen/slice.cc.o' make[2]: *** [paddle/phi/CMakeFiles/phi.dir/build.make:1840: paddle/phi/CMakeFiles/phi.dir/kernels/funcs/eigen/pad.cc.o] Interrupt make[2]: *** [paddle/phi/CMakeFiles/phi.dir/build.make:1896: paddle/phi/CMakeFiles/phi.dir/kernels/funcs/eigen/slice.cc.o] Interrupt make[2]: *** [paddle/phi/CMakeFiles/phi.dir/build.make:1854: paddle/phi/CMakeFiles/phi.dir/kernels/funcs/eigen/reverse.cc.o] Interrupt make[2]: *** [paddle/phi/CMakeFiles/phi.dir/build.make:1756: paddle/phi/CMakeFiles/phi.dir/kernels/funcs/eigen/broadcast.cc.o] Interrupt make[1]: *** [CMakeFiles/Makefile2:4912: paddle/phi/CMakeFiles/phi.dir/all] Interrupt make: *** [Makefile:136: all] Interrupt

请问这个问题怎么解决?感谢各位大佬抽出时间解决issue

版本&环境信息 Version & Environment Information

OS: kylin V10 GCC version: (GCC) 8.2.0 Clang version: N/A CMake version: version 3.27.7 Libc version: glibc 2.28 Python version: 3.10.13

zhangting2020 commented 1 month ago

您好,和相关同事了解了下未遇到过这个问题,另外由于这个是国产化的环境,我们暂无设备可以复现。可以先根据报错尝试解决下。

UnlimitedWand commented 1 month ago

您好,和相关同事了解了下未遇到过这个问题,另外由于这个是国产化的环境,我们暂无设备可以复现。可以先根据报错尝试解决下。

好的,感谢回复!

dynamicheart commented 3 weeks ago

尝试下将

https://github.com/PaddlePaddle/Paddle/blob/v3.0.0-beta1/cmake/external/xpu.cmake#L32

的日期改为20240523,然后重新编译

或者使用v3.0.0-beta2版本进行编译

UnlimitedWand commented 3 weeks ago

尝试下将

https://github.com/PaddlePaddle/Paddle/blob/v3.0.0-beta1/cmake/external/xpu.cmake#L32

的日期改为20240523,然后重新编译

或者使用v3.0.0-beta2版本进行编译

感谢回复,使用v3.0.0-beta2版本没有出现上面的函数报错, 但是在编译过程中这些变量的链接不正确,

出现了文件下载失败报错。 第一个aarch64架构xre包链接错误,我修改为 xre-kylin_v10_server_aarch64.tar.gz后可以下载, 但仍有其他文件下载失败,链接中拼接的变量为空,似乎是没有适配aarch64

dynamicheart commented 3 weeks ago

请问第一个方法可以么?在v3.0.0-beta1的基础上加上以下的修改进行编译

https://github.com/PaddlePaddle/Paddle/blob/v3.0.0-beta1/cmake/external/xpu.cmake#L32 的日期改为20240523,然后重新编译

dynamicheart commented 3 weeks ago

我检查了下v3.0.0-beta2,https://klx-sdk-release-public.su.bcebos.com/xhpc/eb35/20240927/这个链接地址目前没有kylin的产出,所以建议先还是回退到v3.0.0-beta1,然后修改下载链接尝试下

UnlimitedWand commented 3 weeks ago

第一个方法报错: In file included from /work/Paddle/paddle/phi/kernels/fusion/xpu/conv2d_xpu_kernel.cc:20: /work/Paddle/paddle/phi/kernels/xpu/xpu_api_wrapper.h:25:10: fatal error: xblas/cublasLt.h: No such file or directory

include "xblas/cublasLt.h"

      ^~~~~~~~~~~~~~~~~~

compilation terminated. make[2]: [paddle/phi/CMakeFiles/phi.dir/build.make:2666: paddle/phi/CMakeFiles/phi.dir/kernels/fusion/xpu/conv2d_xpu_kernel.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4912: paddle/phi/CMakeFiles/phi.dir/all] Error 2 make: [Makefile:136: all] Error 2

请问第一个方法可以么?在v3.0.0-beta1的基础上加上以下的修改进行编译

https://github.com/PaddlePaddle/Paddle/blob/v3.0.0-beta1/cmake/external/xpu.cmake#L32 的日期改为20240523,然后重新编译

dynamicheart commented 3 weeks ago

好的,收到,我们处理下,尽快修复相关问题

dynamicheart commented 3 weeks ago

第一个方法报错: In file included from /work/Paddle/paddle/phi/kernels/fusion/xpu/conv2d_xpu_kernel.cc:20: /work/Paddle/paddle/phi/kernels/xpu/xpu_api_wrapper.h:25:10: fatal error: xblas/cublasLt.h: No such file or directory #include "xblas/cublasLt.h" ^~~~~~ compilation terminated. make[2]: [paddle/phi/CMakeFiles/phi.dir/build.make:2666: paddle/phi/CMakeFiles/phi.dir/kernels/fusion/xpu/conv2d_xpu_kernel.cc.o] Error 1 make[2]: Waiting for unfinished jobs.... make[1]: [CMakeFiles/Makefile2:4912: paddle/phi/CMakeFiles/phi.dir/all] Error 2 make: [Makefile:136: all] Error 2

请问第一个方法可以么?在v3.0.0-beta1的基础上加上以下的修改进行编译

https://github.com/PaddlePaddle/Paddle/blob/v3.0.0-beta1/cmake/external/xpu.cmake#L32 的日期改为20240523,然后重新编译

请问能否再试一下-DWITH_XPU_XHPC=OFF,再编一下

dynamicheart commented 3 weeks ago

或者只能使用2.6版本了,最新的版本不支持kylinv + R200了