ROCm / clr

MIT License
99 stars 49 forks source link

[Issue]: CLR asserts when a code object is loaded for which no implimentation is avaialble for the GPUS in the system. #102

Open IMbackK opened 1 week ago

IMbackK commented 1 week ago

Problem Description

Loading a shared object containing gpu code that is not compiled for one of the gpus in the system causes CLR to assert here https://github.com/ROCm/clr/blob/65d174c3e35423bf25c9c369513780c9d9ca760c/hipamd/src/hip_code_object.cpp#L1152 when assertions are enabled.

For release builds the assertions in clr are disabled, in this case:

When assertions are disabled and none of the gpus in the system have a implementation available in the loaded code object, clr will silently fail and the application will crash if the code object is used.

When assertions are disabled and there are gpus for which an implementation is available and gpus for which no implementation is available are present in the system clr, depending on the order in which amdgpu.ko initialized the gpus, will return without having loaded any gpu code for the code objects in question, even for the gpus that do have an implementations available.

I would like to note that i DONT think this is a bug in CLR but rather that linking against a shared library for which no gpu code is available for one of the gpus in the system is a bug in the client. However this practice has become extremely common in ROCM's libraries, mainly centered around hipBlasLT and client code owners have instructed me to file a bug against clr.

hipBlasLT only supports, and thus is only compiled for, gfx908 (documentation is incorrect here), gfx90a, gfx94x and gfx11x. In rocm, projects such as pytorch and miopen it has become common practice to unconditionally link against hipblaslt causing clr to assert when these projects are used on a system that contains any other gpu or, alternatively, if assertions are disabled, causes clr to leave the code objects unloaded even for the supported devices in the system when a unsupported gpu is present, causing a crash when they are used.

Operating System

Any linux

CPU

Epyc 7552

GPU

GFX900, GFX906, GFX908, GFX1030

ROCm Version

ROCm 6.2.3

ROCm Component

No response

Steps to Reproduce

Compile clr with assertions enabled

Have a system with a gpu not supported by hipblaslt link any binary against hipblaslt.so (no need to use hipblaslt for anything) observe assert when the binary is run.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 1 week ago

Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!

schung-amd commented 1 week ago

Hi @IMbackK, thanks for keeping up with this. We're working on replacing this assertion and others with proper error handling. I'll reach out to the internal team to see if we have a timeline for this, as well as what the expected behavior will be in this usecase (i.e. heterogenous systems with one or more unsupported GPUs). Let me know if you have any additional questions and I'll pass those on as well.