Closed gjs278 closed 5 months ago
@gjs278 Do you know what compile flag can be used to remove AVX instructions?
Compiling with -mno-avx -mno-avx2
does not remove AVX instruction vxorps on my machine.
Below is what I tried:
cd /home/anchapman/repos/issue1422/rocBLAS/library/src/blas1
/opt/rocm/bin/hipcc -DBUILD_WITH_TENSILE -DROCBLAS_BETA_FEATURES_API -DROCBLAS_INTERNAL_API -DROCBLAS_TENSILE_LAZY_LOAD=1 -DROCBLAS_TENSILE_SEPARATE_ARCH=1 -DROCM_USE_FLOAT16 -DTENSILE_DEFAULT_SERIALIZATION -DTENSILE_MSGPACK=1 -DTENSILE_USE_HIP -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -Drocblas_EXPORTS -I/home/anchapman/repos/issue1422/rocBLAS/library/src -I/home/anchapman/repos/issue1422/rocBLAS/library/include -I/home/anchapman/repos/issue1422/rocBLAS/library/include/internal -I/home/anchapman/repos/issue1422/rocBLAS/library/src/include -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include/rocblas/internal -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include/rocblas -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/include -I/home/anchapman/repos/issue1422/rocBLAS/library/src/blas3/Tensile -I/home/anchapman/repos/issue1422/rocBLAS/build_tensile/release/virtualenv/lib/python3.10/site-packages/Tensile/Source/lib/include -mno-avx -mno-avx2 -D__HIP_HCC_COMPAT_MODE__=1 -O3 -DNDEBUG -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-unused-result -mf16c -Werror=vla -x hip --offload-arch=gfx942 -MD -MT rocblas_scal.cpp.o -MF rocblas_scal.cpp.o.d -o rocblas_scal.cpp.o -c rocblas_scal.cpp
objdump -d rocblas_scal.cpp.o > rocblas_scal.cpp.objdump
grep -l vxorps rocblas_scal.cpp.objdump
If there is a compile flag that removes AVX instructions it can be added to rocBLAS/cmake/toolchain-options.cmake
The F16C ISA extension was not supported by Intel processors until Ivy Bridge (3rd Gen), but the Intel Xeon X5690 is Westmere (1st Gen). The -mf16c
flag is going to cause some instructions to be included that are not available Westmere. I wouldn't expect enabling F16C instructions to result in AVX instructions being emitted, but it's perhaps worth trying without that flag as it would need to be disabled anyway on those CPUs.
If I compile without -mf16c
the program does not have illegal instruction issues anymore. Sample code compiles and the python tests respond with this.
I've attached the build.log
PyTorch built with:
- GCC 13.2
- C++ Version: 201703
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: NO AVX
- HIP Runtime 5.7.31921
- MIOpen 2.20.0
- Build settings: BLAS_INFO=generic, BUILD_TYPE=RelWithDebInfo, CXX_COMPILER=/usr/lib/ccache/bin/x86_64-pc-linux-gnu-g++, CXX_FLAGS=-O2 -march=native -pipe -mno-avx -mno-avx2 -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -Wall -Wextra -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, TORCH_VERSION=2.2.0, USE_CUDA=no, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=ON, USE_GLOG=ON, USE_MKLDNN=no, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=no, USE_OPENMP=yes, USE_ROCM=ON, USE_ROCM_KERNEL_ASSERT=OFF,
Is CUDA available: True
Checking ROCM support...
GOOD: ROCM devices found: 2
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user root is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of:
---> Intel(R) Xeon(R) CPU X5690 @ 3.47GHz
---> gfx906
However, when loading ComfyUI or Stable Diffusion, I get
Total VRAM 16368 MB, total RAM 48177 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon VII : native
VAE dtype: torch.float32
python: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:841: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed.
Aborted
Python 3.11.9 (main, Apr 17 2024, 05:51:00) [GCC 13.2.1 20240210]
Version: v1.9.3
Commit hash: 1c0a0c4c26f78c32095ebc7f8af82f5c04fca8c0
Launching Web UI with arguments: --precision full --no-half
python: /var/tmp/portage/dev-util/hip-5.7.1-r2/work/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:841: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed.
Aborted
What side effect of not having the -mf16c
compiled with rocBLAS will there be? Is there some sort of setting change that won't cause the addFatBinary errors? Is there a sample code I can use or can I somehow compile the binary files it's trying to load to be compatible with that flag not being present?
What side effect of not having the -mf16c compiled with rocBLAS will there be?
I'd be surprised if it has a significant impact, but I'm not an expert. Some half-precision operations on the host may be slower, but I wouldn't expect there to be many of those.
Is there some sort of setting change that won't cause the addFatBinary errors?
I think your system is based on PCIe 2.0, which might not support some of the PCIe features that ROCm depends on (e.g., PCIe atomics). I would suggest setting an environment variable to print more information (export AMD_LOG_LEVEL=3
). See HIP Debugging.
When I used Intel's SDE to emulate a modern CPU, I was able to get these applications to start, even to the point of loading in checkpoints / models, generating prompts, it was just painstakingly slow. The Radeon VII is supposed to be one of the cards that works without PCIe atomics, any other card I tested wouldn't even be able to get output from rocminfo. After trying the export variables, attached are the logs from this code.
#include <iostream>
#include <rocblas.h>
#include <vector>
int main() {
rocblas_status status;
rocblas_handle handle;
status = rocblas_create_handle(&handle);
if (status != rocblas_status_success) {
std::cerr << "rocBLAS handle creation failed" << std::endl;
return 1;
}
const int N = 10;
float alpha = 1.0f; // Scalar for the saxpy operation
float a[N], b[N]; // Example data for demonstration purposes
// Initialize data
for (int i = 0; i < N; i++) {
a[i] = static_cast<float>(i);
b[i] = static_cast<float>(2 * i);
}
float *d_a, *d_b;
// Allocate memory on the device (using pseudo-code; replace with actual memory management)
// hipMalloc((void**)&d_a, N * sizeof(float));
// hipMalloc((void**)&d_b, N * sizeof(float));
// Copy data to device
// hipMemcpy(d_a, a, N * sizeof(float), hipMemcpyHostToDevice);
// hipMemcpy(d_b, b, N * sizeof(float), hipMemcpyHostToDevice);
// Call the rocblas function
rocblas_saxpy(handle, N, &alpha, d_a, 1, d_b, 1);
// Copy result back to host
// hipMemcpy(b, d_b, N * sizeof(float), hipMemcpyDeviceToHost);
std::cout << "Result vector b after saxpy operation:" << std::endl;
for (int i = 0; i < N; i++) {
std::cout << "b[" << i << "] = " << b[i] << std::endl;
}
// Free device memory
// hipFree(d_a);
// hipFree(d_b);
rocblas_destroy_handle(handle);
return 0;
}
Let me know if I should be exporting any other gdb outputs or if these errors
:1:hip_code_object.cpp :594 : 1145783083127 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp :596 : 1145783083131 us: [pid:26452 tid:0x7fac8c8a8c00] Devices:
:1:hip_code_object.cpp :598 : 1145783083133 us: [pid:26452 tid:0x7fac8c8a8c00] amdgcn-amd-amdhsa--gfx906:sramecc+:xnack+ - [Not Found]
:1:hip_code_object.cpp :603 : 1145783083136 us: [pid:26452 tid:0x7fac8c8a8c00] Bundled Code Objects:
:1:hip_code_object.cpp :619 : 1145783083139 us: [pid:26452 tid:0x7fac8c8a8c00] host-x86_64-unknown-linux-- - [Unsupported]
:1:hip_code_object.cpp :616 : 1145783083142 us: [pid:26452 tid:0x7fac8c8a8c00] hipv4-amdgcn-amd-amdhsa--gfx906:xnack- - [code object targetID is amdgcn-amd-amdhsa--gfx906:xnack-]
:1:hip_code_object.cpp :623 : 1145783083145 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp :274 : 1145783083148 us: [pid:26452 tid:0x7fac8c8a8c00] hipErrorNoBinaryForGpu: Couldn't find binary for ptr: 0x8f22f000
:3:hip_platform.cpp :672 : 1145783083153 us: [pid:26452 tid:0x7fac8c8a8c00] init: Returned hipErrorNoBinaryForGpu : continue parsing remaining modules
are enough to point in the right direction.
For reference, I've also attached the verbose output of Stable Diffusion using the packaged 5.7 rocm with Intel's SDE emulating a Meteor Lake cpu. This prepackage doesn't work due to the f16c instructions being present.
I do not think removing -mf16c is not causing your fat binary errors.
I am removing the -mf16c compile flag from the rocBLAS code. In the past the following two intrinsics were used:
The fat binary error appears to be because your GPU is configured in xnack+ mode, but rocblas is built for gfx906:xnack-
rather than gfx906
or gfx906:xnack+
. IIRC, the choice of defaulting to gfx906:xnack-
was made at a time when xnack+ did not function correctly on gfx906 anyway, so there was no point in supporting it. I think that xnack on gfx906 eventually got fixed, but the choice of defaulting to gfx906:xnack-
was never revisited.
The fat binary errors can likely be resolved by configuring your GPU to boot in xnack- mode. Or, possibly by changing the rocBLAS GPU build target to gfx906
instead of gfx906:xnack-
. Although, given that gfx906:xnack+
mode was never officially supported, there may be bugs. You would likely want to build and run the rocBLAS test suite to verify it works.
If you do choose to build as gfx906
, please do report back how it went. I'm curious whether Tensile will handle the architecture correctly without xnack-
specified.
ComfyUI works is as it should now when I run it with
export HSA_XNACK=0
The only patch I am currently compiling rocBLAS with is
All other packages from Gentoo work as-is. My target / compile options through portage don't seem to have something to specify xnack+/- mode, it's simply:
AMDGPU_TARGETS="gfx906"
I think detecting if f16c support is available at compile time should resolve this.
@gjs278 rocBLAS commit c6bc090 removes the compile flag -mf16c.
Thank you for detecting this stale compile flag.
Is your application working? if so, can I close this issue?
@amcamd yes, I was able to get Stable Diffusion and ComfyUI both working, the issue can be closed.
Thank you @gjs278 , closing issue.
Describe the bug
Problem: When running a Python script that imports PyTorch on a system with an Intel Xeon X5690 CPU, which does not support AVX instructions, the program crashes with an "Illegal Instruction" error. Investigation using GDB revealed that the crash is caused by AVX instructions (vxorps, vmovups) in the librocblas.so.3 library.
To Reproduce
rocBLAS-5.7.1 Steps to reproduce the behavior:
Expected behavior
The library should not contain AVX instructions when compiled for a processor that does not support them, to ensure compatibility with all processors that meet the documented minimum requirements.
Actual behavior
The library contains AVX instructions, leading to crashes on processors that do not support AVX.
Log-files
build.log
Environment
environment.txt
Additional context
import torch print("Is CUDA available:", torch.cuda.is_available())
python test.py
Illegal instruction
gdb --args python test.py
GNU gdb (Gentoo 14.2 vanilla) 14.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://bugs.gentoo.org/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python... (No debugging symbols found in python) (gdb) run Starting program: /usr/bin/python test.py [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1". process 7763 is executing new program: /usr/bin/python3.11 [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction. 0x00007fffc4ec5c18 in ?? () from /usr/lib64/librocblas.so.3 (gdb) bt
0 0x00007fffc4ec5c18 in ?? () from /usr/lib64/librocblas.so.3
1 0x00007fffc4ecf16c in ?? () from /usr/lib64/librocblas.so.3
2 0x00007ffff7fcf1ae in call_init () from /lib64/ld-linux-x86-64.so.2
3 0x00007ffff7fcf29c in _dl_init () from /lib64/ld-linux-x86-64.so.2
4 0x00007ffff7fcb56e in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
5 0x00007ffff7fd5d56 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
6 0x00007ffff7fcb4e1 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
7 0x00007ffff7fd60d0 in _dl_open () from /lib64/ld-linux-x86-64.so.2
8 0x00007ffff78a7fcc in ?? () from /usr/lib64/libc.so.6
9 0x00007ffff7fcb4e1 in _dl_catch_exception () from /lib64/ld-linux-x86-64.so.2
10 0x00007ffff7fcb603 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
11 0x00007ffff78a7ac7 in ?? () from /usr/lib64/libc.so.6
12 0x00007ffff78a8081 in dlopen () from /usr/lib64/libc.so.6
13 0x00007ffff7c8842c in ?? () from /usr/lib64/libpython3.11.so.1.0
14 0x00007ffff7c508f0 in ?? () from /usr/lib64/libpython3.11.so.1.0
15 0x00007ffff7c4da05 in ?? () from /usr/lib64/libpython3.11.so.1.0
16 0x00007ffff7b96a8e in ?? () from /usr/lib64/libpython3.11.so.1.0
17 0x00007ffff7b05972 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
18 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0
19 0x00007ffff7b53a3d in ?? () from /usr/lib64/libpython3.11.so.1.0
20 0x00007ffff7b53c3c in PyObject_CallMethodObjArgs () from /usr/lib64/libpython3.11.so.1.0
21 0x00007ffff7c4f7e9 in PyImport_ImportModuleLevelObject () from /usr/lib64/libpython3.11.so.1.0
22 0x00007ffff7b07846 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
23 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0
24 0x00007ffff7c277c3 in PyEval_EvalCode () from /usr/lib64/libpython3.11.so.1.0
25 0x00007ffff7c1f9f0 in ?? () from /usr/lib64/libpython3.11.so.1.0
26 0x00007ffff7b963f6 in ?? () from /usr/lib64/libpython3.11.so.1.0
27 0x00007ffff7b05972 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
28 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0
29 0x00007ffff7b53a3d in ?? () from /usr/lib64/libpython3.11.so.1.0
30 0x00007ffff7b53c3c in PyObject_CallMethodObjArgs () from /usr/lib64/libpython3.11.so.1.0
31 0x00007ffff7c4f7e9 in PyImport_ImportModuleLevelObject () from /usr/lib64/libpython3.11.so.1.0
32 0x00007ffff7b07846 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.11.so.1.0
33 0x00007ffff7c27718 in ?? () from /usr/lib64/libpython3.11.so.1.0
34 0x00007ffff7c277c3 in PyEval_EvalCode () from /usr/lib64/libpython3.11.so.1.0
35 0x00007ffff7c68973 in ?? () from /usr/lib64/libpython3.11.so.1.0
36 0x00007ffff7c68b96 in ?? () from /usr/lib64/libpython3.11.so.1.0
37 0x00007ffff7c68c70 in ?? () from /usr/lib64/libpython3.11.so.1.0
38 0x00007ffff7c6b6ff in _PyRun_SimpleFileObject () from /usr/lib64/libpython3.11.so.1.0
39 0x00007ffff7c6bc74 in _PyRun_AnyFileObject () from /usr/lib64/libpython3.11.so.1.0
40 0x00007ffff7c89450 in Py_RunMain () from /usr/lib64/libpython3.11.so.1.0
41 0x00007ffff7c8999e in Py_BytesMain () from /usr/lib64/libpython3.11.so.1.0
42 0x00007ffff78480d0 in ?? () from /usr/lib64/libc.so.6
43 0x00007ffff7848189 in __libc_start_main () from /usr/lib64/libc.so.6
44 0x0000555555555095 in _start ()
(gdb) disassemble $pc-32,$pc+32 Dump of assembler code from 0x7fffc4ec5bf8 to 0x7fffc4ec5c38: 0x00007fffc4ec5bf8: and $0x20,%al 0x00007fffc4ec5bfa: push 0x30(%rsp) 0x00007fffc4ec5bfe: call 0x7fffc4e6e1e0 hipLaunchKernel@plt 0x00007fffc4ec5c03: add $0x118,%rsp 0x00007fffc4ec5c0a: ret 0x00007fffc4ec5c0b: nopl 0x0(%rax,%rax,1) 0x00007fffc4ec5c10: push %rbx 0x00007fffc4ec5c11: mov %rdi,%rbx 0x00007fffc4ec5c14: sub $0x20,%rsp => 0x00007fffc4ec5c18: vxorps %xmm0,%xmm0,%xmm0 0x00007fffc4ec5c1c: vmovups %ymm0,(%rsp) 0x00007fffc4ec5c21: lea 0x296fe88(%rip),%rsi # 0x7fffc7835ab0 0x00007fffc4ec5c28: lea 0x94e090(%rip),%rcx # 0x7fffc5813cbf 0x00007fffc4ec5c2f: mov %rcx,%rdx 0x00007fffc4ec5c32: mov $0xffffffff,%r8d End of assembler dump.