pip install flash-attn --no-build-isolation failing

ari9dam commented 1 year ago

pip install flash-attn --no-build-isolation fails but pip install flash-attn==1.0.9 --no-build-isolation works

Based on this can you say what I might to try to fix the error?

  torch.__version__  = 2.0.1+cu117

  fatal: not a git repository (or any of the parent directories): .git
  running install
  /opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
    warnings.warn(
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-cpython-38
  creating build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/tmp.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/fav2_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton_tmp.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/rotary.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton_single_query.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attention.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton_tmp_og.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton_varlen.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/attention_kernl.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-38/flash_attn
  creating build/lib.linux-x86_64-cpython-38/flash_attn/modules
  copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
  copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
  copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
  copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
  copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
  creating build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
  creating build/lib.linux-x86_64-cpython-38/flash_attn/layers
  copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
  copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
  copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
  creating build/lib.linux-x86_64-cpython-38/flash_attn/losses
  copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
  copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
  creating build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/gelu_activation.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
  creating build/lib.linux-x86_64-cpython-38/flash_attn/utils
  copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
  copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
  copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
  copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
  copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
  running build_ext
  building 'flash_attn_2_cuda' extension
  creating /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38
  creating /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc
  creating /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn
  creating /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/src
  Emitting ninja build file /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/33] /usr/local/cuda/bin/nvcc  -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/cutlass/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -lineinfo -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
  FAILED: /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o
  /usr/local/cuda/bin/nvcc  -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/cutlass/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -lineinfo -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here

  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
              instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef<T>) [with T=int64_t, <unnamed>=void]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2337): here

  Killed
  [2/33] c++ -MMD -MF /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/flash_api.o.d -pthread -B /opt/conda/envs/ptca/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/cutlass/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/flash_api.cpp: In function ‘void set_params_fprop(Flash_fwd_params&, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, at::Tensor, at::Tensor, at::Tensor, at::Tensor, void*, void*, void*, void*, float, float, bool)’:
  /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/flash_api.cpp:42:38: warning: ‘void* memset(void*, int, size_t)’ clearing an object of non-trivial type ‘struct Flash_fwd_params’; use assignment or value-initialization instead [-Wclass-memaccess]
     42 |     memset(&params, 0, sizeof(params));
        |                                      ^
  In file included from /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/flash_api.cpp:11:
  /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src/flash.h:52:8: note: ‘struct Flash_fwd_params’ declared here
     52 | struct Flash_fwd_params : public Qkv_params {
        |        ^~~~~~~~~~~~~~~~
  [3/33] /usr/local/cuda/bin/nvcc  -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src -I/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/cutlass/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/build/temp.linux-x86_64-cpython-38/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -lineinfo -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(77): here

  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
            detected during:
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
  (61): here
              instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=true, <unnamed>=0]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
              instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef<T>) [with T=int64_t, <unnamed>=void]"
  /opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2337): here

  ptxas info    : 2 bytes gmem
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 252 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 252 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb0ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 252 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb0ELb1ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb0ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb0ELb1ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z27flash_bwd_convert_dq_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EEEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z27flash_bwd_convert_dq_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EEEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 32 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb0ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb0ELb0EEv16Flash_bw[0m[91md_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 252 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 252 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EELb1ELb1ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 254 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z25flash_bwd_dot_do_o_kernelILb1E23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EEEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z25flash_bwd_dot_do_o_kernelILb1E23Flash_bwd_kernel_traitsILi128ELi64ELi64ELi8ELi4ELi2ELi2ELb1ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi64ELi8ES2_EEEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 46 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z27flash_bwd_convert_dq_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EEEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z27flash_bwd_convert_dq_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EEEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 32 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb0ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb0ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 254 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb0ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 254 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb0ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb0ELb0EEv16Flash_bwd_params
      16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb0ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb0ELb1EEv16Flash_bwd_params
      8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb1ELb0EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb1ELb0EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb1ELb1EEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z44flash_bwd_dq_dk_dv_loop_seqk_parallel_kernelI23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EELb1ELb1ELb1ELb1EEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 255 registers, 696 bytes cmem[0]
  ptxas info    : Compiling entry function '_Z25flash_bwd_dot_do_o_kernelILb1E23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EEEv16Flash_bwd_params' for 'sm_80'
  ptxas info    : Function properties for _Z25flash_bwd_dot_do_o_kernelILb1E23Flash_bwd_kernel_traitsILi128ELi64ELi128ELi8ELi2ELi4ELi2ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi64ELi128ELi8ES2_EEEv16Flash_bwd_params
      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
  ptxas info    : Used 46 registers, 696 bytes cmem[0]
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
      subprocess.run(
    File "/opt/conda/envs/ptca/lib/python3.8/subprocess.py", line 516, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-qabfcpz5/flash-attn_a0d5561303b0450dbca282f73f7bdd3d/setup.py", line 202, in <module>
      setup(
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/__init__.py", line 108, in setup
      return distutils.core.setup(**attrs)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py", line 68, in run
      return orig.install.run(self)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/install.py", line 697, in run
      self.run_command('build')
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build.py", line 131, in run
      self.run_command(cmd_name)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
      super().run_command(command)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
      self.build_extensions()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
      build_ext.build_extensions(self)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
      self._build_extensions_serial()
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
      self.build_extension(ext)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
      _build_ext.build_extension(self, ext)
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
      objects = self.compiler.compile(
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. [0m[91merror: legacy-install-failure

× Encountered error while trying to install package. ╰─> flash-attn

note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure. [0mThe command '/bin/bash --login -c pip install flash-attn' returned a non-zero code: 1 2023/08/03 22:09:43 Container failed during run: acb_step_0. No retries remaining. failed to run step ID: acb_step_0: exit status 1

shiqingzhangCSU commented 1 year ago

try MAX_JOBS=4

ari9dam commented 1 year ago

Tried MAX_JOBS=4. It failed as well. MAX_JOBS=1 timed out after 1h30m.

ari9dam commented 1 year ago

@tridao ? Any pointers? Thanks in advance!

tridao commented 1 year ago

There's not enough info here (there's no error message from the compilation log pointing to any specific line). You can try the recommended docker file from Nvidia.

monuminu commented 1 year ago

How to add pip install flash-attn --no-build-isolation in requirements.txt ?

zhaoyukoon commented 1 year ago

I met with similar problem, I fixed it by installing latest pytorch via pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

Besides, use MAX_JOBS=4 would reduce memory usage.

sadransh commented 1 year ago

@ari9dam
IMO, use MAX_JOBS=1 to find out the exact error. in my case

.../python3.8/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file o
r directory                                                                                                                                   
         12 | #include <Python.h>                                                                                                             
            |          ^~~~~~~~~~

which was addressed with also, make sure nothing else (like a training session is running on GPU (not sure how this affects but killing this helped)

sudo apt install python3-dev

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

MAX_JOBS=8 pip install flash-attn --no-build-isolation

on a 216GB Ram system. I did install pytorch with cu121 support, and also, cuda 12.1 was manually installed!

ari9dam commented 1 year ago

I think there are a wide variety of factors in play here. For me, I could not build the docker with flash attn in A100, but (someone) was able to build the same docker in V100 (which I later used in A100). I did not see gain with flash-attn V2 on LLAMA V2. Flash-attn v1 performed better than V2. I replaced the LLAMA MHA forward function as commonly done in hot-patching. My per iteration time (per device batch size 10, seq len 4K) increased from 88 seconds -> 94 seconds while I switched to V2.

On Mon, Aug 7, 2023 at 7:45 PM Sadra @.***> wrote:

@ari9dam https://urldefense.com/v3/__https://github.com/ari9dam__;!!IKRxdwAv5BmarQ!ZvLWEF_fnbirCDpxsUWWXuoRHEBLOoACkxqNdPkMkv-acuGdtoJrdG_fp2RtZpH6CevCsNaNsG4f7H3mlsNWtrc$ IMO, use MAX_JOBS=1 to find out the exact error. in my case

.../python3.8/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file o r directory 12 | #include | ^~~~~~

which was addressed with also, make sure nothing else (like a training session is running on GPU (not sure how this affects but killing this helped)

sudo apt install python3-dev

I did install pytorch with cu121 support, and also, cuda 12.1 was manually installed!

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/Dao-AILab/flash-attention/issues/420*issuecomment-1668826337__;Iw!!IKRxdwAv5BmarQ!ZvLWEF_fnbirCDpxsUWWXuoRHEBLOoACkxqNdPkMkv-acuGdtoJrdG_fp2RtZpH6CevCsNaNsG4f7H3mP3AKDZo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ADL24YQO6UEW5CIRNLNYBJDXUGR3VANCNFSM6AAAAAA3DTBZGI__;!!IKRxdwAv5BmarQ!ZvLWEF_fnbirCDpxsUWWXuoRHEBLOoACkxqNdPkMkv-acuGdtoJrdG_fp2RtZpH6CevCsNaNsG4f7H3mbHW5FK0$ . You are receiving this because you were mentioned.Message ID: @.***>

wbbeyourself commented 1 year ago

torch 2.1.0 cuda 12.1 g++ 10.2.1

执行：

apt-get update && apt-get install -y g++ pip install packaging pip install ninja pip install flash-attn --no-build-isolation

报错如下：

Building wheels for collected packages: flash-attn Building wheel for flash-attn (setup.py): started Building wheel for flash-attn (setup.py): still running... Building wheel for flash-attn (setup.py): finished with status 'error' error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [10 lines of output] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' fatal: not a git repository (or any of the parent directories): .git

torch.version = 2.1.0.dev20230815+cu121

running bdist_wheel Guessing wheel URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.1.1/flash_attn-2.1.1+cu121torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl error: Remote end closed connection without response [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects ERROR: executor failed running [/bin/sh -c pip install flash-attn --no-build-isolation]: runc did not terminate successfully: exit status 1

rockyoung commented 9 months ago

I've tried install from both pip and source code, but no luck 😢

$ pip install flash-attn --no-build-isolation
Collecting flash-attn
  Using cached flash_attn-2.4.2.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from flash-attn) (2.1.0)
Requirement already satisfied: einops in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from flash-attn) (0.7.0)
Requirement already satisfied: packaging in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from flash-attn) (23.2)
Collecting ninja (from flash-attn)
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Requirement already satisfied: filelock in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (3.13.1)
Requirement already satisfied: typing-extensions in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (4.9.0)
Requirement already satisfied: sympy in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (1.12)
Requirement already satisfied: networkx in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (3.1)
Requirement already satisfied: jinja2 in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (3.1.2)
Requirement already satisfied: fsspec in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from torch->flash-attn) (2023.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from jinja2->torch->flash-attn) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /home/myun/miniconda3/envs/myen/lib/python3.8/site-packages (from sympy->torch->flash-attn) (1.3.0)
Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [34 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
      fatal: not a git repository (or any of the parent directories): .git

      torch.__version__  = 2.1.0

      running bdist_wheel
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5sq4q_lb/flash-attn_ace098e663d9463aad67312fd7b22387/setup.py", line 285, in <module>
          setup(
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/__init__.py", line 103, in setup
          return distutils.core.setup(**attrs)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/dist.py", line 963, in run_command
          super().run_command(command)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/tmp/pip-install-5sq4q_lb/flash-attn_ace098e663d9463aad67312fd7b22387/setup.py", line 262, in run
          wheel_url, wheel_filename = get_wheel_url()
        File "/tmp/pip-install-5sq4q_lb/flash-attn_ace098e663d9463aad67312fd7b22387/setup.py", line 231, in get_wheel_url
          torch_cuda_version = parse(torch.version.cuda)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/packaging/version.py", line 54, in parse
          return Version(version)
        File "/home/myun/miniconda3/envs/myen/lib/python3.8/site-packages/packaging/version.py", line 198, in __init__
          match = self._regex.search(version)
      TypeError: expected string or bytes-like object
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects

and checking the /usr/local/cuda dir:

$ ll /usr/local/cuda/
total 144
drwxr-xr-x 17 root root  4096 Dec 26 20:21 ./
drwxr-xr-x 12 root root  4096 Dec 26 20:20 ../
drwxr-xr-x  3 root root  4096 Dec 26 20:21 bin/
drwxr-xr-x  5 root root  4096 Dec 26 20:20 compute-sanitizer/
-rw-r--r--  1 root root   160 Dec 26 20:21 DOCS
-rw-r--r--  1 root root 61498 Dec 26 20:21 EULA.txt
drwxr-xr-x  5 root root  4096 Dec 26 20:21 extras/
drwxr-xr-x  6 root root  4096 Dec 26 20:20 gds/
drwxr-xr-x  2 root root  4096 Dec 26 20:20 gds-12.1/
lrwxrwxrwx  1 root root    28 Dec 26 20:21 include -> targets/x86_64-linux/include/
lrwxrwxrwx  1 root root    24 Dec 26 20:21 lib64 -> targets/x86_64-linux/lib/
drwxr-xr-x  7 root root  4096 Dec 26 20:21 libnvvp/
drwxr-xr-x  7 root root  4096 Dec 26 20:20 nsight-compute-2023.1.0/
drwxr-xr-x  2 root root  4096 Dec 26 20:20 nsightee_plugins/
drwxr-xr-x  6 root root  4096 Dec 26 20:21 nsight-systems-2023.1.2/
drwxr-xr-x  3 root root  4096 Dec 26 20:20 nvml/
drwxr-xr-x  7 root root  4096 Dec 26 20:21 nvvm/
-rw-r--r--  1 root root   524 Dec 26 20:21 README
drwxr-xr-x  3 root root  4096 Dec 26 20:20 share/
drwxr-xr-x  2 root root  4096 Dec 26 20:20 src/
drwxr-xr-x  3 root root  4096 Dec 26 20:20 targets/
drwxr-xr-x  2 root root  4096 Dec 26 20:21 tools/
-rw-r--r--  1 root root  2928 Dec 26 20:20 version.json

and the ninja:

$ ninja --version
1.11.1
$ echo $?
0

zhengxiaodu commented 8 months ago

Because your torch.version.cuda isn't installed right. Check your torch with: $python import torch print(torch.version.cuda)

I just installed my CUDA=11.6 different with system CUDA=10.2, so my torch was reinstalled automaticly to the CPU version.How incredible

AminaAlMarri commented 6 months ago

pip install flash-attn --no-build-isolation Collecting flash-attn Using cached flash_attn-2.5.6.tar.gz (2.5 MB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [20 lines of output] fatal: not a git repository (or any of the parent directories): .git /tmp/pip-install-18brac5p/flash-attn_5e692969183644e58161eed68af0341b/setup.py:78: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc. warnings.warn( Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-18brac5p/flash-attn_5e692969183644e58161eed68af0341b/setup.py", line 133, in CUDAExtension( File "/home/aalmarri/anaconda3/envs/geochat/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension library_dirs += library_paths(cuda=True) File "/home/aalmarri/anaconda3/envs/geochat/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths if (not os.path.exists(_join_cuda_home(lib_dir)) and File "/home/aalmarri/anaconda3/envs/geochat/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home raise EnvironmentError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

  torch.__version__  = 2.0.1+cu117

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip. hint: See above for details.

I did check the torch version and its 11.7.

BDHU commented 6 months ago

Same thing happened to me

gkm0120 commented 5 months ago

I have the same qustion $ pip install flash-attn --no-build-isolation -i https://pypi.tuna.tsinghua.edu.cn/simple Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting flash-attn Using cached https://pypi.tuna.tsinghua.edu.cn/packages/21/cb/33a1f833ac4742c8adba063715bf769831f96d99dbbbb4be1b197b637872/flash_attn-2.5.7.tar.gz (2.5 MB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [22 lines of output] fatal: not a git repository (or any of the parent directories): .git /tmp/pip-install-39fz2cjo/flash-attn_eaede92fcb76455eab13852d3126d861/setup.py:78: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc. warnings.warn( Traceback (most recent call last): File "/mnt/flash-attention-2.5.7/setup.py", line 134, in CUDAExtension( File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension library_dirs += library_paths(cuda=True) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1203, in library_paths if (not os.path.exists(_join_cuda_home(lib_dir)) and ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home raise OSError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

bogan-FMA commented 3 months ago

I have the same qustion $ pip install flash-attn --no-build-isolation -i https://pypi.tuna.tsinghua.edu.cn/simple Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting flash-attn Using cached https://pypi.tuna.tsinghua.edu.cn/packages/21/cb/33a1f833ac4742c8adba063715bf769831f96d99dbbbb4be1b197b637872/flash_attn-2.5.7.tar.gz (2.5 MB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [22 lines of output] fatal: not a git repository (or any of the parent directories): .git /tmp/pip-install-39fz2cjo/flash-attn_eaede92fcb76455eab13852d3126d861/setup.py:78: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc. warnings.warn( Traceback (most recent call last): File "/mnt/flash-attention-2.5.7/setup.py", line 134, in CUDAExtension( File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension library_dirs += library_paths(cuda=True) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1203, in library_paths if (not os.path.exists(_join_cuda_home(lib_dir)) and ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home raise OSError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

you can fix this by add the environment: CUDA_HOME=/path/to/your/cuda

flybird11111 commented 2 weeks ago

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 hi, can you try this gcc version?

Dao-AILab / flash-attention

pip install flash-attn --no-build-isolation failing #420