Open IMbackK opened 3 weeks ago
Hi @IMbackK. Internal ticket has been created to investigate your issue. Thanks!
Hi @IMbackK, thanks for keeping up with this. We're working on replacing this assertion and others with proper error handling. I'll reach out to the internal team to see if we have a timeline for this, as well as what the expected behavior will be in this usecase (i.e. heterogenous systems with one or more unsupported GPUs). Let me know if you have any additional questions and I'll pass those on as well.
Problem Description
Loading a shared object containing gpu code that is not compiled for one of the gpus in the system causes CLR to assert here https://github.com/ROCm/clr/blob/65d174c3e35423bf25c9c369513780c9d9ca760c/hipamd/src/hip_code_object.cpp#L1152 when assertions are enabled.
For release builds the assertions in clr are disabled, in this case:
When assertions are disabled and none of the gpus in the system have a implementation available in the loaded code object, clr will silently fail and the application will crash if the code object is used.
When assertions are disabled and there are gpus for which an implementation is available and gpus for which no implementation is available are present in the system clr, depending on the order in which amdgpu.ko initialized the gpus, will return without having loaded any gpu code for the code objects in question, even for the gpus that do have an implementations available.
I would like to note that i DONT think this is a bug in CLR but rather that linking against a shared library for which no gpu code is available for one of the gpus in the system is a bug in the client. However this practice has become extremely common in ROCM's libraries, mainly centered around hipBlasLT and client code owners have instructed me to file a bug against clr.
hipBlasLT only supports, and thus is only compiled for, gfx908 (documentation is incorrect here), gfx90a, gfx94x and gfx11x. In rocm, projects such as pytorch and miopen it has become common practice to unconditionally link against hipblaslt causing clr to assert when these projects are used on a system that contains any other gpu or, alternatively, if assertions are disabled, causes clr to leave the code objects unloaded even for the supported devices in the system when a unsupported gpu is present, causing a crash when they are used.
Operating System
Any linux
CPU
Epyc 7552
GPU
GFX900, GFX906, GFX908, GFX1030
ROCm Version
ROCm 6.2.3
ROCm Component
No response
Steps to Reproduce
Compile clr with assertions enabled
Have a system with a gpu not supported by hipblaslt link any binary against hipblaslt.so (no need to use hipblaslt for anything) observe assert when the binary is run.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response