ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
339 stars 161 forks source link

[Bug]: Illegal Instruction Crash in librocblas.so.3 due to AVX Instructions on Non-AVX CPU #1422

Closed gjs278 closed 5 months ago

gjs278 commented 5 months ago

Describe the bug

Problem: When running a Python script that imports PyTorch on a system with an Intel Xeon X5690 CPU, which does not support AVX instructions, the program crashes with an "Illegal Instruction" error. Investigation using GDB revealed that the crash is caused by AVX instructions (vxorps, vmovups) in the librocblas.so.3 library.

To Reproduce

rocBLAS-5.7.1 Steps to reproduce the behavior:

  1. Set build flags to "-O2 -march=native -pipe -mno-avx -mno-avx2"
  2. Compile and install rocBLAS-5.7.1-r2 on Gentoo with a CPU that does not support AVX (e.g., Intel Xeon X5690).
  3. Run a Python script that utilizes PyTorch to check CUDA availability.
  4. Observe the crash with the "Illegal Instruction" error.

Expected behavior

The library should not contain AVX instructions when compiled for a processor that does not support them, to ensure compatibility with all processors that meet the documented minimum requirements.

Actual behavior

The library contains AVX instructions, leading to crashes on processors that do not support AVX.

Log-files

build.log

Environment

Hardware description
CPU Intel Xeon X5690 (no AVX support)
GPU AMD Radeon VII
Software version
rocm-core v5.7.1
rocblas v5.7.1

environment.txt

Additional context

import torch print("Is CUDA available:", torch.cuda.is_available())

python test.py

Illegal instruction

gdb --args python test.py

GNU gdb (Gentoo 14.2 vanilla) 14.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://bugs.gentoo.org/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python... (No debugging symbols found in python) (gdb) run Starting program: /usr/bin/python test.py [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". process 7763 is executing new program: /usr/bin/python3.11 [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1".

Program received signal SIGILL, Illegal instruction. 0x00007fffc4ec5c18 in ?? () from /usr/lib64/librocblas.so.3 (gdb) bt

0 0x00007fffc4ec5c18 in ?? () from /usr/lib64/librocblas.so.3

1 0x00007fffc4ecf16c in ?? () from /usr/lib64/librocblas.so.3

2 0x00007ffff7fcf1ae in call_init () from /lib64/ld-linux-x86-64.so.2

3 0x00007ffff7fcf29c in _dl_init () from /lib64/ld-linux-x86-64.so.2

4 0x00007ffff7fcb56e in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2

5 0x00007ffff7fd5d56 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2

6 0x00007ffff7fcb4e1 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2

7 0x00007ffff7fd60d0 in _dl_open () from /lib64/ld-linux-x86-64.so.2

8 0x00007ffff78a7fcc in ?? () from /usr/lib64/libc.so.6

9 0x00007ffff7fcb4e1 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2

10 0x00007ffff7fcb603 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2

11 0x00007ffff78a7ac7 in ?? () from /usr/lib64/libc.so.6

12 0x00007ffff78a8081 in dlopen () from /usr/lib64/libc.so.6

13 0x00007ffff7c8842c in ?? () from /usr/lib64/libpython3.11.so.1.0

14 0x00007ffff7c508f0 in ?? () from /usr/lib64/libpython3.11.so.1.0

15 0x00007ffff7c4da05 in ?? () from /usr/lib64/libpython3.11.so.1.0

16 0x00007ffff7b96a8e in ?? () from /usr/lib64/libpython3.11.so.1.0

17 0x00007ffff7b05972 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0

18 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0

19 0x00007ffff7b53a3d in ?? () from /usr/lib64/libpython3.11.so.1.0

20 0x00007ffff7b53c3c in PyObject_CallMethodObjArgs () from /usr/lib64/libpython3.11.so.1.0

21 0x00007ffff7c4f7e9 in PyImport_ImportModuleLevelObject () from /usr/lib64/libpython3.11.so.1.0

22 0x00007ffff7b07846 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0

23 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0

24 0x00007ffff7c277c3 in PyEval_EvalCode () from /usr/lib64/libpython3.11.so.1.0

25 0x00007ffff7c1f9f0 in ?? () from /usr/lib64/libpython3.11.so.1.0

26 0x00007ffff7b963f6 in ?? () from /usr/lib64/libpython3.11.so.1.0

27 0x00007ffff7b05972 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0

28 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0

29 0x00007ffff7b53a3d in ?? () from /usr/lib64/libpython3.11.so.1.0

30 0x00007ffff7b53c3c in PyObject_CallMethodObjArgs () from /usr/lib64/libpython3.11.so.1.0

31 0x00007ffff7c4f7e9 in PyImport_ImportModuleLevelObject () from /usr/lib64/libpython3.11.so.1.0

32 0x00007ffff7b07846 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0

33 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0

34 0x00007ffff7c277c3 in PyEval_EvalCode () from /usr/lib64/libpython3.11.so.1.0

35 0x00007ffff7c68973 in ?? () from /usr/lib64/libpython3.11.so.1.0

36 0x00007ffff7c68b96 in ?? () from /usr/lib64/libpython3.11.so.1.0

37 0x00007ffff7c68c70 in ?? () from /usr/lib64/libpython3.11.so.1.0

38 0x00007ffff7c6b6ff in _PyRun_SimpleFileObject () from /usr/lib64/libpython3.11.so.1.0

39 0x00007ffff7c6bc74 in _PyRun_AnyFileObject () from /usr/lib64/libpython3.11.so.1.0

40 0x00007ffff7c89450 in Py_RunMain () from /usr/lib64/libpython3.11.so.1.0

41 0x00007ffff7c8999e in Py_BytesMain () from /usr/lib64/libpython3.11.so.1.0

42 0x00007ffff78480d0 in ?? () from /usr/lib64/libc.so.6

43 0x00007ffff7848189 in __libc_start_main () from /usr/lib64/libc.so.6

44 0x0000555555555095 in _start ()

(gdb) disassemble $pc-32,$pc+32 Dump of assembler code from 0x7fffc4ec5bf8 to 0x7fffc4ec5c38: 0x00007fffc4ec5bf8: and $0x20,%al 0x00007fffc4ec5bfa: push 0x30(%rsp) 0x00007fffc4ec5bfe: call 0x7fffc4e6e1e0 hipLaunchKernel@plt 0x00007fffc4ec5c03: add $0x118,%rsp 0x00007fffc4ec5c0a: ret 0x00007fffc4ec5c0b: nopl 0x0(%rax,%rax,1) 0x00007fffc4ec5c10: push %rbx 0x00007fffc4ec5c11: mov %rdi,%rbx 0x00007fffc4ec5c14: sub $0x20,%rsp => 0x00007fffc4ec5c18: vxorps %xmm0,%xmm0,%xmm0 0x00007fffc4ec5c1c: vmovups %ymm0,(%rsp) 0x00007fffc4ec5c21: lea 0x296fe88(%rip),%rsi # 0x7fffc7835ab0 0x00007fffc4ec5c28: lea 0x94e090(%rip),%rcx # 0x7fffc5813cbf 0x00007fffc4ec5c2f: mov %rcx,%rdx 0x00007fffc4ec5c32: mov $0xffffffff,%r8d End of assembler dump.

amcamd commented 5 months ago

@gjs278 Do you know what compile flag can be used to remove AVX instructions?

Compiling with -mno-avx -mno-avx2 does not remove AVX instruction vxorps on my machine.

Below is what I tried:

cd /home/anchapman/repos/issue1422/rocBLAS/library/src/blas1

/opt/rocm/bin/hipcc -DBUILD_WITH_TENSILE -DROCBLAS_BETA_FEATURES_API -DROCBLAS_INTERNAL_API -DROCBLAS_TENSILE_LAZY_LOAD=1 -DROCBLAS_TENSILE_SEPARATE_ARCH=1 -DROCM_USE_FLOAT16 -DTENSILE_DEFAULT_SERIALIZATION -DTENSILE_MSGPACK=1 -DTENSILE_USE_HIP -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -Drocblas_EXPORTS -I/home/anchapman/repos/issue1422/rocBLAS/library/src -I/home/anchapman/repos/issue1422/rocBLAS/library/include -I/home/anchapman/repos/issue1422/rocBLAS/library/include/internal -I/home/anchapman/repos/issue1422/rocBLAS/library/src/include -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include/rocblas/internal -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include/rocblas -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include -I/home/anchapman/repos/issue1422/rocBLAS/library/src/blas3/Tensile -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/virtualenv/lib/python3.10/site-packages/Tensile/Source/lib/include -mno-avx -mno-avx2 -D__HIP_HCC_COMPAT_MODE__=1 -O3 -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-result -mf16c -Werror=vla -x hip --offload-arch=gfx942 -MD -MT rocblas_scal.cpp.o -MF rocblas_scal.cpp.o.d -o rocblas_scal.cpp.o -c rocblas_scal.cpp

objdump -d rocblas_scal.cpp.o > rocblas_scal.cpp.objdump

grep -l vxorps  rocblas_scal.cpp.objdump

If there is a compile flag that removes AVX instructions it can be added to rocBLAS/cmake/toolchain-options.cmake

cgmb commented 5 months ago

The F16C ISA extension was not supported by Intel processors until Ivy Bridge (3rd Gen), but the Intel Xeon X5690 is Westmere (1st Gen). The -mf16c flag is going to cause some instructions to be included that are not available Westmere. I wouldn't expect enabling F16C instructions to result in AVX instructions being emitted, but it's perhaps worth trying without that flag as it would need to be disabled anyway on those CPUs.

gjs278 commented 5 months ago

If I compile without -mf16c the program does not have illegal instruction issues anymore. Sample code compiles and the python tests respond with this.

I've attached the build.log

build.log.gz

python test.py

PyTorch built with:
  - GCC 13.2
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: NO AVX
  - HIP Runtime 5.7.31921
  - MIOpen 2.20.0
  - Build settings: BLAS_INFO=generic, BUILD_TYPE=RelWithDebInfo, CXX_COMPILER=/usr/lib/ccache/bin/x86_64-pc-linux-gnu-g++, CXX_FLAGS=-O2 -march=native -pipe -mno-avx -mno-avx2 -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -Wall -Wextra -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, TORCH_VERSION=2.2.0, USE_CUDA=no, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=ON, USE_GLOG=ON, USE_MKLDNN=no, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=no, USE_OPENMP=yes, USE_ROCM=ON, USE_ROCM_KERNEL_ASSERT=OFF, 

Is CUDA available: True

python test-rocm.py

Checking ROCM support...
GOOD: ROCM devices found:  2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user root is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of: 
--->  Intel(R) Xeon(R) CPU           X5690  @ 3.47GHz  
--->  gfx906

However, when loading ComfyUI or Stable Diffusion, I get

ComfyUI: python main.py

Total VRAM 16368 MB, total RAM 48177 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon VII : native
VAE dtype: torch.float32
python: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:841: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed.
Aborted

Stable DIffusion: python launch.py --precision full --no-half

Python 3.11.9 (main, Apr 17 2024, 05:51:00) [GCC 13.2.1 20240210]
Version: v1.9.3
Commit hash: 1c0a0c4c26f78c32095ebc7f8af82f5c04fca8c0
Launching Web UI with arguments: --precision full --no-half
python: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:841: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed.
Aborted

What side effect of not having the -mf16c compiled with rocBLAS will there be? Is there some sort of setting change that won't cause the addFatBinary errors? Is there a sample code I can use or can I somehow compile the binary files it's trying to load to be compatible with that flag not being present?

cgmb commented 5 months ago

What side effect of not having the -mf16c compiled with rocBLAS will there be?

I'd be surprised if it has a significant impact, but I'm not an expert. Some half-precision operations on the host may be slower, but I wouldn't expect there to be many of those.

Is there some sort of setting change that won't cause the addFatBinary errors?

I think your system is based on PCIe 2.0, which might not support some of the PCIe features that ROCm depends on (e.g., PCIe atomics). I would suggest setting an environment variable to print more information (export AMD_LOG_LEVEL=3). See HIP Debugging.

gjs278 commented 5 months ago

When I used Intel's SDE to emulate a modern CPU, I was able to get these applications to start, even to the point of loading in checkpoints / models, generating prompts, it was just painstakingly slow. The Radeon VII is supposed to be one of the cards that works without PCIe atomics, any other card I tested wouldn't even be able to get output from rocminfo. After trying the export variables, attached are the logs from this code.

rocblas_test

rocblas_test.log

#include <iostream>
#include <rocblas.h>
#include <vector>

int main() {
    rocblas_status status;
    rocblas_handle handle;
    status = rocblas_create_handle(&handle);

    if (status != rocblas_status_success) {
        std::cerr << "rocBLAS handle creation failed" << std::endl;
        return 1;
    }

    const int N = 10;
    float alpha = 1.0f; // Scalar for the saxpy operation
    float a[N], b[N]; // Example data for demonstration purposes

    // Initialize data
    for (int i = 0; i < N; i++) {
        a[i] = static_cast<float>(i);
        b[i] = static_cast<float>(2 * i);
    }

    float *d_a, *d_b;

    // Allocate memory on the device (using pseudo-code; replace with actual memory management)
    // hipMalloc((void**)&d_a, N * sizeof(float));
    // hipMalloc((void**)&d_b, N * sizeof(float));

    // Copy data to device
    // hipMemcpy(d_a, a, N * sizeof(float), hipMemcpyHostToDevice);
    // hipMemcpy(d_b, b, N * sizeof(float), hipMemcpyHostToDevice);

    // Call the rocblas function
    rocblas_saxpy(handle, N, &alpha, d_a, 1, d_b, 1);

    // Copy result back to host
    // hipMemcpy(b, d_b, N * sizeof(float), hipMemcpyDeviceToHost);

    std::cout << "Result vector b after saxpy operation:" << std::endl;
    for (int i = 0; i < N; i++) {
        std::cout << "b[" << i << "] = " << b[i] << std::endl;
    }

    // Free device memory
    // hipFree(d_a);
    // hipFree(d_b);

    rocblas_destroy_handle(handle);
    return 0;
}

ComfyUI

comfyui_hip_export.log

Stable Diffusion

stable-diffusion-verbose.log

Let me know if I should be exporting any other gdb outputs or if these errors

:1:hip_code_object.cpp      :594 : 1145783083127 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp      :596 : 1145783083131 us: [pid:26452 tid:0x7fac8c8a8c00]   Devices:
:1:hip_code_object.cpp      :598 : 1145783083133 us: [pid:26452 tid:0x7fac8c8a8c00]     amdgcn-amd-amdhsa--gfx906:sramecc+:xnack+ - [Not Found]
:1:hip_code_object.cpp      :603 : 1145783083136 us: [pid:26452 tid:0x7fac8c8a8c00]   Bundled Code Objects:
:1:hip_code_object.cpp      :619 : 1145783083139 us: [pid:26452 tid:0x7fac8c8a8c00]     host-x86_64-unknown-linux-- - [Unsupported]
:1:hip_code_object.cpp      :616 : 1145783083142 us: [pid:26452 tid:0x7fac8c8a8c00]     hipv4-amdgcn-amd-amdhsa--gfx906:xnack- - [code object targetID is amdgcn-amd-amdhsa--gfx906:xnack-]
:1:hip_code_object.cpp      :623 : 1145783083145 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp           :274 : 1145783083148 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Couldn't find binary for ptr: 0x8f22f000
:3:hip_platform.cpp         :672 : 1145783083153 us: [pid:26452 tid:0x7fac8c8a8c00] init: Returned hipErrorNoBinaryForGpu : continue parsing remaining modules

are enough to point in the right direction.

For reference, I've also attached the verbose output of Stable Diffusion using the packaged 5.7 rocm with Intel's SDE emulating a Meteor Lake cpu. This prepackage doesn't work due to the f16c instructions being present.

Intel SDE Stable Diffusion (ROCM 5.7 packaged from pip)

intel-sde-packaged-torch-rocm5.7-stable-diffusion.log

amcamd commented 5 months ago

I do not think removing -mf16c is not causing your fat binary errors.

I am removing the -mf16c compile flag from the rocBLAS code. In the past the following two intrinsics were used:

cgmb commented 5 months ago

The fat binary error appears to be because your GPU is configured in xnack+ mode, but rocblas is built for gfx906:xnack- rather than gfx906 or gfx906:xnack+. IIRC, the choice of defaulting to gfx906:xnack- was made at a time when xnack+ did not function correctly on gfx906 anyway, so there was no point in supporting it. I think that xnack on gfx906 eventually got fixed, but the choice of defaulting to gfx906:xnack- was never revisited.

The fat binary errors can likely be resolved by configuring your GPU to boot in xnack- mode. Or, possibly by changing the rocBLAS GPU build target to gfx906 instead of gfx906:xnack-. Although, given that gfx906:xnack+ mode was never officially supported, there may be bugs. You would likely want to build and run the rocBLAS test suite to verify it works.

If you do choose to build as gfx906, please do report back how it went. I'm curious whether Tensile will handle the architecture correctly without xnack- specified.

gjs278 commented 5 months ago

ComfyUI works is as it should now when I run it with

export HSA_XNACK=0

screenshot

The only patch I am currently compiling rocBLAS with is

remove-mf16c.txt

All other packages from Gentoo work as-is. My target / compile options through portage don't seem to have something to specify xnack+/- mode, it's simply:

AMDGPU_TARGETS="gfx906"

I think detecting if f16c support is available at compile time should resolve this.

amcamd commented 5 months ago

@gjs278 rocBLAS commit c6bc090 removes the compile flag -mf16c.

Thank you for detecting this stale compile flag.

Is your application working? if so, can I close this issue?

gjs278 commented 5 months ago

@amcamd yes, I was able to get Stable Diffusion and ComfyUI both working, the issue can be closed.

amcamd commented 5 months ago

Thank you @gjs278 , closing issue.