hipamd: SIGSEGV when code for particular device architecture is absent

shibe2 commented 1 year ago

ROCm 5.6.0

This bug has 2 parts.

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

cjatin commented 1 year ago

can you share the code, compile command and your system config (GPU name).

shibe2 commented 1 year ago

Reproduction: https://github.com/shibe2/hipamd-crash-4

Tested on multiple systems with different AMD GPUs, each has 1 GPU.

Real world occurrence of this bug is that PyTorch crashes if it was compiled with ROCm, but without the code for particular GPU that end user has: AUTOMATIC1111/stable-diffusion-webui#11712

Epliz commented 1 year ago

IMO, the desired behaviour would be that the GPU for which there are missing kernels is not detected as a device, but no crash happens and other GPUs can be used (same effect as masking out the GPU with ROCR_VISIBLD_DEVICE).

This seems particularly relevant to me for scenarios where a user might have an unsupported APU but a supported discrete GPU.

cjatin commented 1 year ago

can you run the example with AMD_LOG_LEVEL=7 environment variable and share the logs.

Also you might need -fPIC with -shared

shibe2 commented 1 year ago

@Epliz It must be noted that multiple fat binaries may be loaded in a single process, each with different supported architectures.

@cjatin I believe, in my case, PIC is automatically enabled when needed. I used AMD_LOG_LEVEL when I was investigating the crash. I put my findings in the original report. Whoever will be working on this issue can play with my reproduction code and set any options they like.

cjatin commented 1 year ago

After adding -fPIC to the Makefile

./app native1.so gfx801.so native2.so
native1.so: ok
gfx801.so: hipErrorInvalidDeviceFunction
native2.so: ok

It might be HIP version difference. Can you tell me the HIP version you are using.

It can be seen via hipcc -v or apt show hip-dev

My makefile changes:

native%.so: lib.cpp
    hipcc -o $@ -fPIC -shared $<

gfx%.so: lib.cpp
    hipcc --offload-arch=gfx$* -o $@ -fPIC -shared $<

shibe2 commented 1 year ago

For me -fPIC makes no difference.

./app native1.so gfx801.so native2.so native1.so: Segmentation fault (core dumped)

hipcc -v clang version 16.0.0 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /opt/rocm/llvm/bin Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1 Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1 Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/13.2.1 Candidate multilib: .;@ m64 Candidate multilib: 32;@ m32 Selected multilib: .;@ m64 Found HIP installation: /opt/rocm, version 5.6.31061

WeeBull commented 1 year ago

I get the same behaviour as well (-fPIC or not). For me, I have:

GPU: gfx1102
CPU: 5900X
Kernel: 6.5.3
HIP: 5.6.31062 (hipcc -v)

PlatformState::init returns immediately if digestFatBinary fails, leaving not only the failed binary uninitialized, but also all binaries that happen to be further in the list. There is no indication of this condition to the application, and by default, no diagnostic message.

hip::Function::getStatFunc and other functions use null pointer from modules_, and the program crashes.

I too, had traced it to that null pointer from modules_, but I hadn't discovered why it was null.

cjatin commented 10 months ago

I think the issue might be the iGPU present in the system. Can someone seeing failure share the logs while running with AMD_LOG_LEVEL=7

WeeBull commented 10 months ago

I think the issue might be the iGPU present in the system.

That certainly can be a problem. I helped somebody out on discord who was having that issue with a ryzen 7000 series and a 7900 xtx. The software found the integrated GPU before the discrete GPU. We had to use environment variables to get it to ignore the integrated one.

It's not my situation though, I only have a discreet GPU in the system.

shibe2 commented 5 months ago

I tested it with ROCm 6.0.2. It no longer crashes, but it fails with hipErrorSharedObjectInitFailed. For example:

./app native1.so native1.so: ok

but

./app native1.so gfx908.so native1.so: hipErrorSharedObjectInitFailed gfx908.so: hipErrorSharedObjectInitFailed

That is, presence of a kernel with missing architecture causes all other kernels to fail. If would be better if in my example native1 continued to work.

This report is specifically about a crash, and that seems to be fixed, so I'm closing this.

Also, it may only affect cases when modules are loaded before HIP initialization.

ROCm / clr

hipamd: SIGSEGV when code for particular device architecture is absent #4