[Issue]: MI300X fused_attn CK Backend Broken HIP runtime error: invalid device function 3rdparty/composable_kernel/include/ck_tile/host/hip_check_error.hpp: 18in function: hip_check_error

OrenLeung commented 1 week ago

Problem Description

For fused attention, the CK backend is broken and causes the following error

Command to Reprod: NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=1 NVTE_FUSED_ATTN_AOTRITON=0 python ./reprod.py

  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 4383, in forward
    output = FusedAttnFunc.apply(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformer_engine/pytorch/attention.py", line 3732, in forward
    out_ret, aux_ctx_tensors = fused_attn_fwd(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 1035, in fused_attn_fwd
    output_tensors = tex.fused_attn_fwd(
RuntimeError: HIP runtime error: invalid device function. /workspace/TransformerEngine/transformer_engine/common/ck_fused_attn/../../../3rdparty/composable_kernel/include/ck_tile/host/hip_check_error.hpp: 18in function: hip_check_error

The workaround that i am using is to disable CK backend NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 python ./reprod.py

Operating System

Ubuntu

CPU

AMD CPU

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

Versions

root@NODENAME:/workspace/llm-train-bench# pip list | grep torch
^[[Apytorch-triton-rocm     3.1.0+cf34004b8a
torch                   2.6.0.dev20241012+rocm6.2
torchvision             0.18.0a0+68ba7ec
root@NODENAME:/workspace/llm-train-bench# pip list | grep transformer
transformer_engine      1.8.0.dev0+691dc23

Install Instructions

FROM rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0

RUN apt install nano

RUN pip install uv

RUN uv pip install --system ipython pytest fire pydantic pybind11

RUN pip3 uninstall -y torch

RUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2

WORKDIR /workspace/

RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
ENV NVTE_USE_HIPBLASLT=1
ENV NVTE_FRAMEWORK=pytorch
ENV PYTORCH_ROCM_ARCH=gfx942

RUN cd TransformerEngine && pip install .

WORKDIR /workspace/llm-train-bench/

CMD ["/usr/bin/bash"]

Reprod GPT2 XL 1.5B Training

import contextlib

import torch
import torch.nn.functional as F
import torch.nn as nn

from pydantic.dataclasses import dataclass

@dataclass
class GPTConfig:
    n_layers: int    # L
    n_heads: int     # H
    d_embd: int      # E
    max_seq_len: int = 1024
    vocab_size: int  = 50304 # V
    arch_name: str = 'gpt'

    @staticmethod
    def estimate_flops_per_token(model, config):
        # get param count
        N = sum(p.numel() for p in model.parameters())

        # print param count in B
        print(f"Param count: {N/1e9}B")

        head_dim = config['d_embd'] // config['n_heads'] 

        flops_per_token = 6 * N + 12 * config['n_layers'] * config['n_heads'] * head_dim * config['max_seq_len']

        return flops_per_token

    def __post_init__(self):
        assert self.d_embd % self.n_heads == 0, 'd_embd must be a multiple of n_heads.'

class GPT(nn.Module):
    def __init__(self, vocab_size, max_seq_len, n_layers, d_embd, **kwargs):
        super().__init__()
        self.tok_embd = nn.Embedding(vocab_size, d_embd)
        self.pos_embd = nn.Embedding(max_seq_len, d_embd)

        # self.tsfmr_blks = nn.ModuleList(GPTBlock(d_embd, **kwargs) for _ in range(n_layers))
        import transformer_engine.pytorch as te
        self.tsfmr_blks = nn.ModuleList(te.TransformerLayer(
                    d_embd,
                    d_embd * 4,
                    kwargs['n_heads'],
                    layer_number=i+1,
                    # Optional, for speedups
                    fuse_qkv_params=True,
                    attn_input_format='bshd'
                ) 
                for i in range(n_layers)                       
                )

        self.out_norm = nn.LayerNorm(d_embd)

    def forward(self, idx_BT):
        pos_T = torch.arange(idx_BT.size(1), dtype=torch.int64, device=idx_BT.device)
        x_BTE = self.tok_embd(idx_BT) + self.pos_embd(pos_T).unsqueeze(0)

        for tsfmr_blk in self.tsfmr_blks:
            x_BTE = tsfmr_blk(x_BTE)

        x_BTE = self.out_norm(x_BTE)
        logits_BTV = x_BTE @ self.tok_embd.weight.T  # Weight tying

        return logits_BTV

def train(
    gpu_id: int = 0,
    bsz: int = 8,
    grad_acc_steps: int = 8,
):
    torch.manual_seed(3985)
    torch.cuda.set_device(gpu_id)

    cfg_json = {
        "n_layers": 48,
        "n_heads": 25,
        "d_embd": 1600,
        "max_seq_len": 1024,
        "vocab_size": 50304,
    }

    cfg_m = GPTConfig(**cfg_json)
    model = GPT(**cfg_json).to(gpu_id)

    optimizer = torch.optim.AdamW(model.parameters(), fused=True)
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda t: 1.0)

    flops_per_token = cfg_m.estimate_flops_per_token(model, cfg_json)
    flops_per_iter = flops_per_token * (bsz * cfg_m.max_seq_len)

    flops_promised = 2600e12

    model.train()

    import transformer_engine.pytorch as te
    from transformer_engine.common.recipe import Format, DelayedScaling
    fp8_format = Format.HYBRID
    # Reasonable default setting
    fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=16, amax_compute_algo="max")
    # Note: wrapped ctx in a function because the te.fp8_autocast object cannot be reused as a context for some reason.
    @contextlib.contextmanager
    def ctx():
        with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
            with torch.amp.autocast(device_type='cuda', dtype=torch.bfloat16):
                yield

    with ctx():
         for step_idx in range(100):
            input_BT = torch.randint(50304, [8, 1024], dtype=torch.int64).to('cuda:0')
            label_BT = torch.randint(50304, [8, 1024], dtype=torch.int64).to('cuda:0')
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)
            start.record()

            logits_BTV = model(input_BT)
            loss = F.cross_entropy(logits_BTV.flatten(0, 1), label_BT.flatten())
            loss /= grad_acc_steps
            loss.backward()

            if (step_idx + 1) % grad_acc_steps == 0:  # Assume n_steps % grad_acc_steps == 0
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad(set_to_none=True)

            end.record()
            torch.cuda.synchronize()

            t = start.elapsed_time(end) / 1e3
            flops_per_sec = flops_per_iter / t
            mfu = flops_per_sec / flops_promised

            print(f'{(flops_per_sec/1e12):.2f} TFLOP/s  MFU={mfu:.2%}')

if __name__ == '__main__':
    import fire
    fire.Fire(train)

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

daltunay commented 1 week ago

You can try adding this env variable: HSA_OVERRIDE_GFX_VERSION=10.3.0

OrenLeung commented 1 week ago

HSA_OVERRIDE_GFX_VERSION=10.3.0

Unfortunately this flag turns it into a core dump :(

cc: @hliuca

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=1 NVTE_FUSED_ATTN_AOTRITON=0 python ./reprod.py
Memory access fault by GPU node-2 (Agent handle: 0x8eb5bd0) on address (nil). Reason: Unknown.
GPU core dump failed
HW Exception by GPU node-2 (Agent handle: 0x97fd960) reason :GPU Hang
Aborted (core dumped)

hliuca commented 1 week ago

Hi @OrenLeung this has been reported internally. Thanks.

wenchenvincent commented 1 week ago

Thanks for reporting this issue. We have identified the root cause to be that a CMake module that we used to build the CK Flash attention would require access to GPUs to determine the architecture targets to build for. This would fail when building with a Dockerfile even if you're on a machine with GPUs. And we should not rely on access to GPUs when building anyway. We will have a fix for this soon.

A workaround for now is to build within the docker container on a MI300X machine.

OrenLeung commented 1 week ago

Thanks for reporting this issue. We have identified the root cause to be that a CMake module that we used to build the CK Flash attention would require access to GPUs to determine the architecture targets to build for. This would fail when building with a Dockerfile even if you're on a machine with GPUs. And we should not rely on access to GPUs when building anyway. We will have a fix for this soon.

Thanks @wenchenvincent ! Do you have a timeline on when the fix for being able to build with Dockerfile would be? I really prefer building this libraries inside Dockerfile as it takes greater than 1 hour to build

A workaround for now is to build within the docker container on a MI300X machine.

Thanks! I will do this on the meantime

cc: @hliuca

OrenLeung commented 1 week ago

hi @wenchenvincent ,

I can confirm that the workaround fixes this issue. Tho it is very time consuming workaround

cc: @hliuca

wenchenvincent commented 1 week ago

@OrenLeung We have a PR in review (https://github.com/ROCm/TransformerEngine/pull/77). I expect it should be merged into dev branch today or tomorrow.

hliuca commented 1 week ago

Thank @wenchenvincent for looking into this and fix.

OrenLeung commented 6 days ago

Hi @wenchenvincent ,

Thank you for the fix! I can confirm that it fixed the issue & i can now successfully build TE using the following Dockerfile & I can confirm that I no longer run into this bug.

Please let me know if there recommended changes to my Dockerfile to improve performance

cc: @hliuca

Dockerfile


FROM rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0

RUN apt install -y nano

RUN pip install uv

RUN uv pip install --system ipython pytest fire pydantic pybind11

RUN pip3 uninstall -y torch

RUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2

WORKDIR /workspace/

RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
ENV NVTE_USE_HIPBLASLT=1
ENV NVTE_FRAMEWORK=pytorch
ENV NVTE_ROCM_ARCH=gfx942

RUN cd TransformerEngine && pip install .

WORKDIR /workspace/llm-train-bench/

CMD ["/usr/bin/bash"]

ROCm / TransformerEngine