Open OrenLeung opened 1 week ago
Related issue: https://github.com/ROCm/ROCm/issues/2536
You can try adding this env variable: HSA_OVERRIDE_GFX_VERSION=10.3.0
HSA_OVERRIDE_GFX_VERSION=10.3.0
Unfortunately this flag turns it into a core dump :(
cc: @hliuca
$ HSA_OVERRIDE_GFX_VERSION=10.3.0 NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=1 NVTE_FUSED_ATTN_AOTRITON=0 python ./reprod.py
Memory access fault by GPU node-2 (Agent handle: 0x8eb5bd0) on address (nil). Reason: Unknown.
GPU core dump failed
HW Exception by GPU node-2 (Agent handle: 0x97fd960) reason :GPU Hang
Aborted (core dumped)
Hi @OrenLeung this has been reported internally. Thanks.
Thanks for reporting this issue. We have identified the root cause to be that a CMake module that we used to build the CK Flash attention would require access to GPUs to determine the architecture targets to build for. This would fail when building with a Dockerfile even if you're on a machine with GPUs. And we should not rely on access to GPUs when building anyway. We will have a fix for this soon.
A workaround for now is to build within the docker container on a MI300X machine.
Thanks for reporting this issue. We have identified the root cause to be that a CMake module that we used to build the CK Flash attention would require access to GPUs to determine the architecture targets to build for. This would fail when building with a Dockerfile even if you're on a machine with GPUs. And we should not rely on access to GPUs when building anyway. We will have a fix for this soon.
Thanks @wenchenvincent ! Do you have a timeline on when the fix for being able to build with Dockerfile would be? I really prefer building this libraries inside Dockerfile
as it takes greater than 1 hour to build
A workaround for now is to build within the docker container on a MI300X machine.
Thanks! I will do this on the meantime
cc: @hliuca
hi @wenchenvincent ,
I can confirm that the workaround fixes this issue. Tho it is very time consuming workaround
cc: @hliuca
@OrenLeung We have a PR in review (https://github.com/ROCm/TransformerEngine/pull/77). I expect it should be merged into dev branch today or tomorrow.
Thank @wenchenvincent for looking into this and fix.
Hi @wenchenvincent ,
Thank you for the fix! I can confirm that it fixed the issue & i can now successfully build TE using the following Dockerfile
& I can confirm that I no longer run into this bug.
Please let me know if there recommended changes to my Dockerfile
to improve performance
cc: @hliuca
FROM rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
RUN apt install -y nano
RUN pip install uv
RUN uv pip install --system ipython pytest fire pydantic pybind11
RUN pip3 uninstall -y torch
RUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2
WORKDIR /workspace/
RUN git clone --recursive https://github.com/ROCm/TransformerEngine.git
ENV NVTE_USE_HIPBLASLT=1
ENV NVTE_FRAMEWORK=pytorch
ENV NVTE_ROCM_ARCH=gfx942
RUN cd TransformerEngine && pip install .
WORKDIR /workspace/llm-train-bench/
CMD ["/usr/bin/bash"]
Problem Description
For fused attention, the CK backend is broken and causes the following error
Command to Reprod:
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=1 NVTE_FUSED_ATTN_AOTRITON=0 python ./reprod.py
The workaround that i am using is to disable CK backend
NVTE_FUSED_ATTN=1 NVTE_FUSED_ATTN_CK=0 NVTE_FUSED_ATTN_AOTRITON=1 python ./reprod.py
Operating System
Ubuntu
CPU
AMD CPU
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Versions
Install Instructions
Reprod GPT2 XL 1.5B Training
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response