Open donglixp opened 6 months ago
lash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[30/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[31/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[32/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[33/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[34/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[35/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:8388> <MCOperand Reg:4846> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:4814> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
[36/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4622> <MCOperand Reg:4702> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4766> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
[37/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[38/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[39/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[40/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4862> <MCOperand Reg:4846> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4814> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
[41/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4622> <MCOperand Reg:4702> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4766> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
[42/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[43/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[44/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[45/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
What GPU are you using? AFAIK, this repository unless supports MI200 and above GPUs.
@AlpinDale Singularity.ND96asr_MI200_v4 MI200
I can successfully compile the kernel with image: rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 . But failed with image: base/job/pytorch/acpt-rocm5.6.1_ubuntu20.04_py3.8_pytorch_2.0.1:20230925T234619269 registry: singularitybase.azurecr.io
[14/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[15/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:519> <MCOperand Reg:527> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:487> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
[16/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[17/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[18/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
[17/50] /opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
FAILED: /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o
/opt/rocm-5.6.1/bin/hipcc -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:519> <MCOperand Reg:527> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:487> <MCOperand Expr:(.LBB2_-1)>>
clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.1/llvm/bin
clang-16: note: diagnostic msg: Error generating preprocessed source(s).
I confirmed @donglixp 's error above. The clang compiler used in ROCM 5.6.1 is below. The "fatal error: error in backend: Not supported instr: <MCInst 0
@AlpinDale @jeffdaily What's workaround or fix for ROCM 5.6.1 for flash attention?
/opt/rocm-5.6.1/bin/hipcc --version HIP version: 5.6.31062-73ed8adfd AMD clang version 16.0.0
@howiejayz reproduce this compiler issue on ROCm 5.6.1. We will try to find a WA in FA.
@donglixp @LiweiPeng The issue is caused by the backward causal mask not support for the hipcc in rocm-5.6.1. I wonder is this feature necessary for you? If not, I can provide a version with the feature disabled.
@donglixp @LiweiPeng Is it possible for you to use ROCm 5.6 instead? The WA proposed by @howiejayz kills some fundamental features.
@sabreshao 2 questions: 1) For 'ROCm 5.6', do you mean 'ROCm 5.6' docker image? If so, we are OK with docker image ROCm 5.6. 2) Does 'ROCm 5.6' docker image work with flash attention? I didn't test it myself.
@LiweiPeng 1, yes; 2, yes.
Thanks. We'll test docker image ROCm5.6
I came across the same issue for rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
image: base/job/pytorch/acpt-rocm5.6.1_ubuntu20.04_py3.8_pytorch_2.0.1:20230925T234619269 registry: singularitybase.azurecr.io
The error log: