ROCm / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
109 stars 33 forks source link

installation error #28

Open donglixp opened 6 months ago

donglixp commented 6 months ago

image: base/job/pytorch/acpt-rocm5.6.1_ubuntu20.04_py3.8_pytorch_2.0.1:20230925T234619269 registry: singularitybase.azurecr.io

cd $(mktemp -d)
git clone --recursive https://github.com/ROCmSoftwarePlatform/flash-attention.git
export GPU_ARCHS="gfx90a"
cd flash-attention
export PYTHON_SITE_PACKAGES=$(python -c 'import site; print(site.getsitepackages()[0])')
patch "${PYTHON_SITE_PACKAGES}/torch/utils/hipify/hipify_python.py" hipify_patch.patch
pip install .

The error log:

      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      Force skipping hipification of CK file: /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include/ck/host_utility/hip_check_error.hpp
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/device_memory.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/device_memory.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip [skipped, no changes]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip [skipped, already hipified]
      /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -> /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip [skipped, no changes]
      Successfully preprocessed all matching files.
      Total number of unsupported CUDA function calls: 0

      Total number of replaced kernel launches: 1
      INFO:root:running bdist_wheel
      2023/12/09 01:56:01 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
      2023/12/09 01:56:01 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
      INFO:root:running build
      INFO:root:running build_py
      INFO:root:creating build
      INFO:root:creating build/lib.linux-x86_64-cpython-38
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-38/flash_attn
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-38/flash_attn/models
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-38/flash_attn/ops
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/losses
      INFO:root:copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
      INFO:root:copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-38/flash_attn/losses
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-38/flash_attn/modules
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/layers
      INFO:root:copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
      INFO:root:copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
      INFO:root:copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/layers
      INFO:root:creating build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-38/flash_attn/utils
      INFO:root:running build_ext
      INFO:root:building 'flash_attn_2_cuda' extension
      INFO:root:creating /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38
      INFO:root:creating /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc
      INFO:root:creating /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm
      INFO:root:creating /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src
      2023/12/09 01:56:07 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
      2023/12/09 01:56:07 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
      Emitting ninja build file /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/build.ninja...
      Compiling objects...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      2023/12/09 01:56:12 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
      2023/12/09 01:56:12 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
      [1/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/device_memory.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/device_memory.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [2/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/flash_api_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/flash_api_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [3/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [4/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [5/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [6/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [7/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [8/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [9/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim32_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [10/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [11/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [12/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [13/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [14/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [15/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [16/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o

38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [45/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [46/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [47/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [48/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [49/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [50/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1888, in _run_ninja_build
          subprocess.run(
        File "/opt/conda/envs/ptca/lib/python3.8/subprocess.py", line 516, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/tmp.2H7X7ZCvOC/flash-attention/setup.py", line 312, in <module>
          setup(
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/__init__.py", line 108, in setup
          return distutils.core.setup(**attrs)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build.py", line 131, in run
          self.run_command(cmd_name)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
          self.distribution.run_command(command)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
          self.build_extensions()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 842, in build_extensions
          build_ext.build_extensions(self)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
          self._build_extensions_serial()
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
          self.build_extension(ext)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
          _build_ext.build_extension(self, ext)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
          super(build_ext, self).build_extension(ext)
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
          objects = self.compiler.compile(
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 657, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1569, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1904, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
DEBUG:background_dirsync:Syncing '/scratch/azureml/cr/j/819d19685e444c87a7e2060db9503e82/exe/wd/logs/amlt_code_runner.txt' to '/scratch/amlt_code/outputs/logs/amlt_code_runner.txt'
DEBUG:background_dirsync:Syncing '/scratch/amlt_code/outputs/logs/amlt_code_runner.txt' to '/mnt/output/projects/torchscale/amlt-results/7297884822.26008-89297075-9743-4c5e-b88c-4ec94ba60fcd/logs/amlt_code_runner.txt'
Failed to build flash-attn
ERROR: Could not build wheels for flash-attn, which is required to install pyproject.toml-based projects
donglixp commented 6 months ago
lash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [30/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [31/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [32/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [33/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [34/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [35/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o
      /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:8388> <MCOperand Reg:4846> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:4814> <MCOperand Expr:(.LBB2_-1)>>
      clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
      AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
      Target: x86_64-unknown-linux-gnu
      Thread model: posix
      InstalledDir: /opt/rocm-5.6.1/llvm/bin
      clang-16: note: diagnostic msg: Error generating preprocessed source(s).
      [36/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o
      /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4622> <MCOperand Reg:4702> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4766> <MCOperand Expr:(.LBB2_-1)>>
      clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
      AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
      Target: x86_64-unknown-linux-gnu
      Thread model: posix
      InstalledDir: /opt/rocm-5.6.1/llvm/bin
      clang-16: note: diagnostic msg: Error generating preprocessed source(s).
      [37/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [38/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [39/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim64_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [40/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o
      /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4862> <MCOperand Reg:4846> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4814> <MCOperand Expr:(.LBB2_-1)>>
      clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
      AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
      Target: x86_64-unknown-linux-gnu
      Thread model: posix
      InstalledDir: /opt/rocm-5.6.1/llvm/bin
      clang-16: note: diagnostic msg: Error generating preprocessed source(s).
      [41/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o
      /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim64_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:4622> <MCOperand Reg:4702> <MCOperand Expr:(.LBB2_9)> <MCOperand Reg:4766> <MCOperand Expr:(.LBB2_-1)>>
      clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
      AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
      Target: x86_64-unknown-linux-gnu
      Thread model: posix
      InstalledDir: /opt/rocm-5.6.1/llvm/bin
      clang-16: note: diagnostic msg: Error generating preprocessed source(s).
      [42/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim64_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [43/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [44/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_grouped_hdim32_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [45/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.2H7X7ZCvOC/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.hip -o /tmp/tmp.2H7X7ZCvOC/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim128_bf16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
AlpinDale commented 6 months ago

What GPU are you using? AFAIK, this repository unless supports MI200 and above GPUs.

donglixp commented 6 months ago

@AlpinDale Singularity.ND96asr_MI200_v4 MI200

donglixp commented 6 months ago

I can successfully compile the kernel with image: rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 . But failed with image: base/job/pytorch/acpt-rocm5.6.1_ubuntu20.04_py3.8_pytorch_2.0.1:20230925T234619269 registry: singularitybase.azurecr.io

donglixp commented 6 months ago
 [14/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_noncausal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [15/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      FAILED: /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o
      /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:519> <MCOperand Reg:527> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:487> <MCOperand Expr:(.LBB2_-1)>>
      clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
      AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
      Target: x86_64-unknown-linux-gnu
      Thread model: posix
      InstalledDir: /opt/rocm-5.6.1/llvm/bin
      clang-16: note: diagnostic msg: Error generating preprocessed source(s).
      [16/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [17/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_batched_hdim128_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
      [18/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_fwd_runner_grouped_hdim128_fp16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
donglixp commented 6 months ago
[17/50] /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
  FAILED: /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o
  /opt/rocm-5.6.1/bin/hipcc  -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/TH -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THC -I/opt/conda/envs/ptca/lib/python3.8/site-packages/torch/include/THH -I/opt/rocm-5.6.1/include -I/opt/conda/envs/ptca/include/python3.8 -c -c /tmp/tmp.4kFKFTE2zc/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /tmp/tmp.4kFKFTE2zc/flash-attention/build/temp.linux-x86_64-cpython-38/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=gfx90a -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
  fatal error: error in backend: Not supported instr: <MCInst 0 <MCOperand Reg:519> <MCOperand Reg:527> <MCOperand Expr:(.LBB2_3)> <MCOperand Reg:487> <MCOperand Expr:(.LBB2_-1)>>
  clang-16: error: clang frontend command failed with exit code 70 (use -v to see invocation)
  AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.1 23332 4f9bb99d78a4d8d9770be38b91ebd004ea4d2a3a)
  Target: x86_64-unknown-linux-gnu
  Thread model: posix
  InstalledDir: /opt/rocm-5.6.1/llvm/bin
  clang-16: note: diagnostic msg: Error generating preprocessed source(s).
LiweiPeng commented 6 months ago

I confirmed @donglixp 's error above. The clang compiler used in ROCM 5.6.1 is below. The "fatal error: error in backend: Not supported instr: <MCInst 0 ..." means that the compiler didn't support the flash attention source code.

@AlpinDale @jeffdaily What's workaround or fix for ROCM 5.6.1 for flash attention?

/opt/rocm-5.6.1/bin/hipcc --version HIP version: 5.6.31062-73ed8adfd AMD clang version 16.0.0

sabreshao commented 6 months ago

@howiejayz reproduce this compiler issue on ROCm 5.6.1. We will try to find a WA in FA.

howiejayz commented 6 months ago

@donglixp @LiweiPeng The issue is caused by the backward causal mask not support for the hipcc in rocm-5.6.1. I wonder is this feature necessary for you? If not, I can provide a version with the feature disabled.

sabreshao commented 6 months ago

@donglixp @LiweiPeng Is it possible for you to use ROCm 5.6 instead? The WA proposed by @howiejayz kills some fundamental features.

LiweiPeng commented 6 months ago

@sabreshao 2 questions: 1) For 'ROCm 5.6', do you mean 'ROCm 5.6' docker image? If so, we are OK with docker image ROCm 5.6. 2) Does 'ROCm 5.6' docker image work with flash attention? I didn't test it myself.

sabreshao commented 6 months ago

@LiweiPeng 1, yes; 2, yes.

LiweiPeng commented 6 months ago

Thanks. We'll test docker image ROCm5.6

JTWang2000 commented 2 weeks ago

I came across the same issue for rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1