Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.44k stars 1.36k forks source link

Installation failed with flash attnetion>1 #586

Open Vezora-Corp opened 1 year ago

Vezora-Corp commented 1 year ago

Trying to run: pip install flash-attn --no-build-isolation System build Build cuda_11.8.r11.8/compiler.31833905_0 Windows 11 3090 Python 3.11.4 Pytorch 2.0.1+cu117

Installing a build without flash attention 2 does work EG pip install flash-attn<2, i tried "pip install flash-attn===1.0.4 --no-build-isolation" with success. I pasted as much of the error code that wasn't cut off due legth in the txt file ErrorMessage.txt

cholleme commented 1 year ago

I have the same error.

Final error is

C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/functional.hpp(104): error: no instance of overloaded function "cute::abs" matches the argument list
            argument types are: (const cute::_1)

In a long string of template instantiations.

C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/functional.hpp(104): error: no instance of overloaded function "cute::abs" matches the argument list
            argument types are: (const cute::_1)
          detected during:
            instantiation of "decltype(auto) cute::abs_fn::operator()(T &&) const [with T=const cute::_1 &]" 
C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/tuple_algorithms.hpp(283): here
            instantiation of "auto cute::transform_leaf(const T &, F &&) [with T=cute::_1, F=cute::abs_fn &]" 
C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/tuple_algorithms.hpp(281): here
            instantiation of function "lambda [](const auto &)->auto [with <auto-1>=cute::_1]" 
C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/tuple_algorithms.hpp(114): here
            instantiation of "auto cute::detail::tapply(T &&, F &&, G &&, cute::seq<I...>) [with T=const cute::tuple<cute::C<1>, int> &, F=lambda [](const auto &)->auto &, G=lambda [](const auto &...)->auto, I=<0, 1>]" 
C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/tuple_algorithms.hpp(236): here
            instantiation of "auto cute::transform(const T &, F &&) [with T=cute::tuple<cute::C<1>, int>, F=lambda [](const auto &)->auto]" 
C:/MachineLearning/MiniSDXL/flash-attention/csrc/cutlass/include\cute/algorithm/tuple_algorithms.hpp(281): here
            [ 6 instantiation contexts not shown ]
            instantiation of "auto cute::tile_to_shape(const cute::ComposedLayout<A, O, B> &, const Shape &, const ModeOrder &) [with A=cute::Swizzle<2, 3, 3>, O=cute::C<0>, B=cute::Layout<cute::tuple<cute::C<8>, cute::C<32>>, cute::tuple<cute::_32, cute::_1>>, Shape=cute::tuple<cute::C<64>, cute::C<96>>, ModeOrder=cute::GenColMajor]" 
C:\MachineLearning\MiniSDXL\flash-attention\csrc\flash_attn\src\kernel_traits.h(239): here
            instantiation of class "Flash_bwd_kernel_traits<kHeadDim_, kBlockM_, kBlockN_, kNWarps_, AtomLayoutMSdP_, AtomLayoutNdKV, AtomLayoutMdQ, Is_V_in_regs_, No_double_buffer_, elem_type, Base> [with kHeadDim_=96, kBlockM_=64, kBlockN_=128, kNWarps_=8, AtomLayoutMSdP_=2, AtomLayoutNdKV=4, AtomLayoutMdQ=4, Is_V_in_regs_=false, No_double_buffer_=false, elem_type=cutlass::half_t, Base=Flash_kernel_traits<96, 64, 128, 8, cutlass::half_t>]" 
C:\MachineLearning\MiniSDXL\flash-attention\csrc\flash_attn\src\flash_bwd_launch_template.h(49): here
            instantiation of "void run_flash_bwd_seqk_parallel<Kernel_traits,Is_dropout>(Flash_bwd_params &, cudaStream_t, __nv_bool) [with Kernel_traits=Flash_bwd_kernel_traits<96, 64, 128, 8, 2, 4, 4, false, false, cutlass::half_t, Flash_kernel_traits<96, 64, 128, 8, cutlass::half_t>>, Is_dropout=true]" 
C:\MachineLearning\MiniSDXL\flash-attention\csrc\flash_attn\src\flash_bwd_launch_template.h(135): here
            instantiation of "void run_flash_bwd<Kernel_traits,Is_dropout>(Flash_bwd_params &, cudaStream_t, __nv_bool) [with Kernel_traits=Flash_bwd_kernel_traits<96, 64, 128, 8, 2, 4, 4, false, false, cutlass::half_t, Flash_kernel_traits<96, 64, 128, 8, cutlass::half_t>>, Is_dropout=true]" 
C:\MachineLearning\MiniSDXL\flash-attention\csrc\flash_attn\src\flash_bwd_launch_template.h(211): here
            instantiation of "void run_mha_bwd_hdim96<T>(Flash_bwd_params &, cudaStream_t, __nv_bool) [with T=cutlass::half_t]" 
C:\MachineLearning\MiniSDXL\flash-attention\csrc\flash_attn\src\flash_bwd_hdim96_fp16_sm80.cu(9): here

This then lead to error types being generated and lots of other arithmetic template derivations failing.

yogurt7771 commented 5 months ago

I got the same error. Any one knows how to fix it?