PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

Failed to build Release/3.0-beta (a842a0f) #66033

Closed leo0519 closed 3 weeks ago

leo0519 commented 1 month ago

bug描述 Describe the Bug

Fail to build paddle with branch release/3.0-beta. According to the following error message, the version of flash-attention could be wrong, since the arguments type does not match.

[43/2096] Building CUDA object paddle/phi/CMakeFiles/phi_kernel_gpu.dir/kernels/gpu/flash_attn_kernel.cu.o
FAILED: paddle/phi/CMakeFiles/phi_kernel_gpu.dir/kernels/gpu/flash_attn_kernel.cu.o 
/usr/local/bin/ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DBRPC_WITH_GLOG -DCUDA_TOOLKIT_ROOT_DIR=\"/usr/local/cuda\" -DCUDA_VERSION_MAJOR=\"12\" -DCUDA_VERSION_MINOR=\"5\" -DCUDNN_MAJOR_VERSION=\"9\" -DEIGEN_USE_GPU -DPADDLE_DISABLE_PROFILER -DPADDLE_DLL_EXPORT -DPADDLE_ON_INFERENCE -DPADDLE_USE_OPENBLAS -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=2.6.1 -DPADDLE_VERSION_INTEGER=2006001 -DPADDLE_WITH_AVX -DPADDLE_WITH_CCCL -DPADDLE_WITH_CRYPTO -DPADDLE_WITH_CUDA -DPADDLE_WITH_CUDNN_FRONTEND -DPADDLE_WITH_CUPTI -DPADDLE_WITH_CUTLASS -DPADDLE_WITH_DGC -DPADDLE_WITH_DISTRIBUTE -DPADDLE_WITH_FLASHATTN -DPADDLE_WITH_GLOO -DPADDLE_WITH_INFERENCE_API_TEST -DPADDLE_WITH_MEMORY_EFFICIENT_ATTENTION -DPADDLE_WITH_NCCL -DPADDLE_WITH_POCKETFFT -DPADDLE_WITH_PSCORE -DPADDLE_WITH_RPC -DPADDLE_WITH_SSE3 -DPADDLE_WITH_TENSORRT -DPADDLE_WITH_TESTING -DPADDLE_WITH_XBYAK -DPHI_SHARED -DSPCONV_WITH_CUTLASS=0 -DSTATIC_IR -DTRT_PLUGIN_FP16_AVALIABLE -DXBYAK64 -DXBYAK_NO_OP_NAMES -Dphi_kernel_gpu_EXPORTS -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cccl/thrust -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cccl/libcudacxx/include -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cccl/cub -I/home/scratch.ylichen_sw/paddle-gitlab/build -I/home/scratch.ylichen_sw/paddle-gitlab/paddle/fluid/framework/io -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/zlib/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/gflags/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/glog/include -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cutlass -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cutlass/include -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cutlass/tools/util/include -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/eigen3 -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/threadpool -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/dlpack/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/xxhash/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/warpctc/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/warprnnt/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/utf8proc/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/openblas/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/protobuf/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/nlohmann_json/include -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/numpy/core/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/pybind/src/extern_pybind/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/gtest/include -I/home/scratch.ylichen_sw/paddle-gitlab/third_party/cccl -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/gloo/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/snappy/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/leveldb/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/brpc/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/libmct/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/rocksdb/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/jemalloc/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/xbyak/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/xbyak/include/xbyak -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/dgc/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/cryptopp/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/pocketfft/src -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/install/flashattn/include -I/home/scratch.ylichen_sw/paddle-gitlab/build/third_party/cudnn-frontend/src/extern_cudnn_frontend/include -I/usr/local/cuda/targets/x86_64-linux/include -I/usr/local/cuda/include -I/home/scratch.ylichen_sw/paddle-gitlab -I/home/scratch.ylichen_sw/paddle-gitlab/build/../paddle/fluid/framework/io --cudart shared -D_MWAITXINTRIN_H_INCLUDED -D__STRICT_ANSI__ -Wno-deprecated-gpu-targets  -gencode arch=compute_80,code=sm_80 -w --expt-relaxed-constexpr --expt-extended-lambda  -Xcompiler="-Wall" -Xcompiler="-Wextra" -Xcompiler="-Werror" -Xcompiler="-fPIC" -Xcompiler="-fno-omit-frame-pointer" -Xcompiler="-Wno-unused-parameter" -Xcompiler="-Wno-unused-function" -Xcompiler="-Wno-error=literal-suffix" -Xcompiler="-Wno-error=unused-local-typedefs" -Xcompiler="-Wno-error=unused-function" -Xcompiler="-Wno-error=array-bounds" -Xcompiler="-mavx" -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -MD -MT paddle/phi/CMakeFiles/phi_kernel_gpu.dir/kernels/gpu/flash_attn_kernel.cu.o -MF paddle/phi/CMakeFiles/phi_kernel_gpu.dir/kernels/gpu/flash_attn_kernel.cu.o.d -x cu -c /home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu -o paddle/phi/CMakeFiles/phi_kernel_gpu.dir/kernels/gpu/flash_attn_kernel.cu.o
/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu(124): error: no instance of function template "phi::dynload::DynLoad__flash_attn_varlen_fwd::operator()" matches the argument list
            argument types are: (const void *, const void *, const void *, const int32_t *, const int32_t *, void *, void *, void *, void *, int, int64_t, int64_t, int, int, int, int, int, int, float, float, float, __nv_bool, __nv_bool, __nv_bool, cudaStream_t, uint64_t, uint64_t, const void *, int64_t *, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, long, long, long, long, __nv_bool)
            object type is: phi::dynload::DynLoad__flash_attn_varlen_fwd
    bool succ = phi::dynload::flash_attn_varlen_fwd(
                ^
/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/backends/dynload/flashattn.h(55): note #3327-D: candidate function template "phi::dynload::DynLoad__flash_attn_varlen_fwd::operator()" failed deduction
  struct DynLoad__flash_attn_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd = dlsym(flashattn_dso_handle, "flash_attn_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd)(args...); } }; extern DynLoad__flash_attn_fwd flash_attn_fwd; struct DynLoad__flash_attn_varlen_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_fwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_fwd)(args...); } }; extern DynLoad__flash_attn_varlen_fwd flash_attn_varlen_fwd; struct DynLoad__flash_attn_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd = dlsym(flashattn_dso_handle, "flash_attn_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd)(args...); } }; extern DynLoad__flash_attn_bwd flash_attn_bwd; struct DynLoad__flash_attn_varlen_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_bwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_bwd)(args...); } }; extern DynLoad__flash_attn_varlen_bwd flash_attn_varlen_bwd; struct DynLoad__flash_attn_fwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_fwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_fwd_with_bias_and_mask flash_attn_fwd_with_bias_and_mask; struct DynLoad__flash_attn_bwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_bwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_bwd_with_bias_and_mask flash_attn_bwd_with_bias_and_mask; struct DynLoad__flash_attn_error { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_error(args...)) { using flashattnFunc = decltype(&::flash_attn_error); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_error = dlsym(flashattn_dso_handle, "flash_attn_error"); return reinterpret_cast<flashattnFunc>(p_flash_attn_error)(args...); } }; extern DynLoad__flash_attn_error flash_attn_error;;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
          detected during:
            instantiation of "void phi::FlashAttnUnpaddedBaseKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, int64_t, int64_t, float, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, __nv_bool) [with T=phi::dtype::float16, Context=phi::GPUContext]" at line 216
            instantiation of "void phi::FlashAttnUnpaddedKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, int64_t, int64_t, float, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::float16, Context=phi::GPUContext]" at line 566

/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu(124): error: no instance of function template "phi::dynload::DynLoad__flash_attn_varlen_fwd::operator()" matches the argument list
            argument types are: (const void *, const void *, const void *, const int32_t *, const int32_t *, void *, void *, void *, void *, int, int64_t, int64_t, int, int, int, int, int, int, float, float, float, __nv_bool, __nv_bool, __nv_bool, cudaStream_t, uint64_t, uint64_t, const void *, int64_t *, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, long, long, long, long, __nv_bool)
            object type is: phi::dynload::DynLoad__flash_attn_varlen_fwd
    bool succ = phi::dynload::flash_attn_varlen_fwd(
                ^
/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/backends/dynload/flashattn.h(55): note #3327-D: candidate function template "phi::dynload::DynLoad__flash_attn_varlen_fwd::operator()" failed deduction
  struct DynLoad__flash_attn_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd = dlsym(flashattn_dso_handle, "flash_attn_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd)(args...); } }; extern DynLoad__flash_attn_fwd flash_attn_fwd; struct DynLoad__flash_attn_varlen_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_fwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_fwd)(args...); } }; extern DynLoad__flash_attn_varlen_fwd flash_attn_varlen_fwd; struct DynLoad__flash_attn_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd = dlsym(flashattn_dso_handle, "flash_attn_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd)(args...); } }; extern DynLoad__flash_attn_bwd flash_attn_bwd; struct DynLoad__flash_attn_varlen_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_bwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_bwd)(args...); } }; extern DynLoad__flash_attn_varlen_bwd flash_attn_varlen_bwd; struct DynLoad__flash_attn_fwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_fwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_fwd_with_bias_and_mask flash_attn_fwd_with_bias_and_mask; struct DynLoad__flash_attn_bwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_bwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_bwd_with_bias_and_mask flash_attn_bwd_with_bias_and_mask; struct DynLoad__flash_attn_error { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_error(args...)) { using flashattnFunc = decltype(&::flash_attn_error); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_error = dlsym(flashattn_dso_handle, "flash_attn_error"); return reinterpret_cast<flashattnFunc>(p_flash_attn_error)(args...); } }; extern DynLoad__flash_attn_error flash_attn_error;;
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
          detected during:
            instantiation of "void phi::FlashAttnUnpaddedBaseKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, int64_t, int64_t, float, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, __nv_bool) [with T=phi::dtype::bfloat16, Context=phi::GPUContext]" at line 216
            instantiation of "void phi::FlashAttnUnpaddedKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, int64_t, int64_t, float, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::bfloat16, Context=phi::GPUContext]" at line 566

/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu(396): error: no instance of function template "phi::dynload::DynLoad__flash_attn_fwd::operator()" matches the argument list
            argument types are: (const void *, const void *, const void *, void *, void *, void *, void *, int, int64_t, int64_t, int, int, int, int, int, int, float, float, const float, __nv_bool, __nv_bool, __nv_bool, cudaStream_t, uint64_t, uint64_t, const void *, int64_t *, const void *, int64_t *, int, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t)
            object type is: phi::dynload::DynLoad__flash_attn_fwd
    bool succ = phi::dynload::flash_attn_fwd(
                ^
/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/backends/dynload/flashattn.h(55): note #3327-D: candidate function template "phi::dynload::DynLoad__flash_attn_fwd::operator()" failed deduction
  struct DynLoad__flash_attn_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd = dlsym(flashattn_dso_handle, "flash_attn_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd)(args...); } }; extern DynLoad__flash_attn_fwd flash_attn_fwd; struct DynLoad__flash_attn_varlen_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_fwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_fwd)(args...); } }; extern DynLoad__flash_attn_varlen_fwd flash_attn_varlen_fwd; struct DynLoad__flash_attn_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd = dlsym(flashattn_dso_handle, "flash_attn_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd)(args...); } }; extern DynLoad__flash_attn_bwd flash_attn_bwd; struct DynLoad__flash_attn_varlen_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_bwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_bwd)(args...); } }; extern DynLoad__flash_attn_varlen_bwd flash_attn_varlen_bwd; struct DynLoad__flash_attn_fwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_fwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_fwd_with_bias_and_mask flash_attn_fwd_with_bias_and_mask; struct DynLoad__flash_attn_bwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_bwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_bwd_with_bias_and_mask flash_attn_bwd_with_bias_and_mask; struct DynLoad__flash_attn_error { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_error(args...)) { using flashattnFunc = decltype(&::flash_attn_error); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_error = dlsym(flashattn_dso_handle, "flash_attn_error"); return reinterpret_cast<flashattnFunc>(p_flash_attn_error)(args...); } }; extern DynLoad__flash_attn_error flash_attn_error;;
                                                                    ^
          detected during:
            instantiation of "void phi::FlashAttnBaseKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, int, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::float16, Context=phi::GPUContext]" at line 481
            instantiation of "void phi::FlashAttnKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::float16, Context=phi::GPUContext]" at line 586

/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu(396): error: no instance of function template "phi::dynload::DynLoad__flash_attn_fwd::operator()" matches the argument list
            argument types are: (const void *, const void *, const void *, void *, void *, void *, void *, int, int64_t, int64_t, int, int, int, int, int, int, float, float, const float, __nv_bool, __nv_bool, __nv_bool, cudaStream_t, uint64_t, uint64_t, const void *, int64_t *, const void *, int64_t *, int, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t, int64_t)
            object type is: phi::dynload::DynLoad__flash_attn_fwd
    bool succ = phi::dynload::flash_attn_fwd(
                ^
/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/backends/dynload/flashattn.h(55): note #3327-D: candidate function template "phi::dynload::DynLoad__flash_attn_fwd::operator()" failed deduction
  struct DynLoad__flash_attn_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd = dlsym(flashattn_dso_handle, "flash_attn_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd)(args...); } }; extern DynLoad__flash_attn_fwd flash_attn_fwd; struct DynLoad__flash_attn_varlen_fwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_fwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_fwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_fwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_fwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_fwd)(args...); } }; extern DynLoad__flash_attn_varlen_fwd flash_attn_varlen_fwd; struct DynLoad__flash_attn_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd = dlsym(flashattn_dso_handle, "flash_attn_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd)(args...); } }; extern DynLoad__flash_attn_bwd flash_attn_bwd; struct DynLoad__flash_attn_varlen_bwd { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_varlen_bwd(args...)) { using flashattnFunc = decltype(&::flash_attn_varlen_bwd); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_varlen_bwd = dlsym(flashattn_dso_handle, "flash_attn_varlen_bwd"); return reinterpret_cast<flashattnFunc>(p_flash_attn_varlen_bwd)(args...); } }; extern DynLoad__flash_attn_varlen_bwd flash_attn_varlen_bwd; struct DynLoad__flash_attn_fwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_fwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_fwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_fwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_fwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_fwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_fwd_with_bias_and_mask flash_attn_fwd_with_bias_and_mask; struct DynLoad__flash_attn_bwd_with_bias_and_mask { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_bwd_with_bias_and_mask(args...)) { using flashattnFunc = decltype(&::flash_attn_bwd_with_bias_and_mask); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_bwd_with_bias_and_mask = dlsym(flashattn_dso_handle, "flash_attn_bwd_with_bias_and_mask"); return reinterpret_cast<flashattnFunc>(p_flash_attn_bwd_with_bias_and_mask)(args...); } }; extern DynLoad__flash_attn_bwd_with_bias_and_mask flash_attn_bwd_with_bias_and_mask; struct DynLoad__flash_attn_error { template <typename... Args> auto operator()(Args... args) -> decltype(flash_attn_error(args...)) { using flashattnFunc = decltype(&::flash_attn_error); std::call_once(flashattn_dso_flag, []() { flashattn_dso_handle = phi::dynload::GetFlashAttnDsoHandle(); }); static void* p_flash_attn_error = dlsym(flashattn_dso_handle, "flash_attn_error"); return reinterpret_cast<flashattnFunc>(p_flash_attn_error)(args...); } }; extern DynLoad__flash_attn_error flash_attn_error;;
                                                                    ^
          detected during:
            instantiation of "void phi::FlashAttnBaseKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, int, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::bfloat16, Context=phi::GPUContext]" at line 481
            instantiation of "void phi::FlashAttnKernel<T,Context>(const Context &, const phi::DenseTensor &, const phi::DenseTensor &, const phi::DenseTensor &, const paddle::optional<phi::DenseTensor> &, const paddle::optional<phi::DenseTensor> &, float, __nv_bool, __nv_bool, __nv_bool, const std::string &, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *, phi::DenseTensor *) [with T=phi::dtype::bfloat16, Context=phi::GPUContext]" at line 586

4 errors detected in the compilation of "/home/scratch.ylichen_sw/paddle-gitlab/paddle/phi/kernels/gpu/flash_attn_kernel.cu".
ninja: build stopped: subcommand failed.

其他补充信息 Additional Supplementary Information

ARCH_FLAGS="-march=sandybridge -mtune=broadwell"
CXX_FLAGS="`-DCUDNN_WARN_DEPRECATED $ARCH_FLAGS`"
CUDA_FLAGS="`-DCUDNN_WARN_DEPRECATED -t2 --forward-unknown-to-host-compiler -Xfatbin=-compress-all $ARCH_FLAGS`"
export SKIP_DOWNLOAD_INFERENCE_DATA=ON

cmake -Bbuild -S. \
    -GNinja \
    -DCMAKE_CXX_FLAGS="$CXX_FLAGS" \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_FLAGS="$CUDA_FLAGS" \
    -DCUDA_ARCH_NAME=Manual \
    -DCUDA_ARCH_BIN="80 90" \
    -DWITH_INCREMENTAL_COVERAGE=OFF \
    -DWITH_INFERENCE_API_TEST=ON \
    -DWITH_DISTRIBUTE=ON \
    -DWITH_COVERAGE=OFF \
    -DWITH_TENSORRT=ON \
    -DWITH_TESTING=ON \
    -DWITH_ROCM=OFF \
    -DWITH_RCCL=OFF \
    -DWITH_STRIP=ON \
    -DWITH_MKL=OFF \
    -DWITH_AVX=ON \
    -DWITH_GPU=ON \
    -DWITH_PYTHON=ON \
    -DWITH_CUDNN_FRONTEND=ON \
    -DPY_VERSION=$PYVER \
    -Wno-dev

cmake --build paddle/build -j$((`nproc`))
wangguan1995 commented 1 month ago

Hi there, could you provide more information about your CUDA and GPU?

leo0519 commented 1 month ago

Hi @wangguan1995 The environment is CUDA 12.5, but according to the error message, the issue should be the API mismatch.

leo0519 commented 3 weeks ago

Close this issue since it is not related to a bug in Paddle.