Closed joanbm closed 1 month ago
Hi @joanbm , Could you please run your test with the follow patch? This fixes the pad_size address -
+++ b/opensrc/hsa-runtime/core/runtime/amd_blit_sdma.cpp
@@ -471,0 +472 @@ hsa_status_t BlitSdma<RingIndexTy, HwIndexMonotonic, SizeToCountOffset, useGCR>:
+ command_addr += trap_command_size_;
@@ -482 +483 @@ hsa_status_t BlitSdma<RingIndexTy, HwIndexMonotonic, SizeToCountOffset, useGCR>:
- dword_command_addr[total_command_size/4] = (pad_size/4 - 1) << 16;
+ dword_command_addr[0] = (pad_size/4 - 1) << 16;
@shwetagkhatri Everything seems to work fine with your patch applied (I tried the test case above, hashcat, and some other simple OpenCL programs), thanks.
Thanks for confirming @joanbm. This fix will be available with ROCM 6.1 release.
Problem Description
After upgrading my Arch Linux system with a Ryzen 5700G APU to ROCm 6.0.0, running pretty much any OpenCL application (see below for an example) fails.
In particular, creating an OpenCL command queue, building a kernel, etc. works fine, but when trying to run operations over OpenCL buffers (i.e.
clEnqueueReadBuffer
,clEnqueueWriteBuffer
), the application in the best scenario hangs (when running it on a TTY), and in the worst scenario causes a GPU reset (when running it on a graphical session). An error message*ERROR* ring sdma0 timeout
is also logged to dmesg along with more logs related to the GPU reset (full logs included below).If I downgrade ROCm to 5.7.1, OpenCL applications run fine again.
Another workaround I found is setting the environment variable
HSA_ENABLE_SDMA=0
. This appears to avoid the problematic code path and OpenCL applications are able to run again.Bisection
I have tried to bisect the issue and it appears to be related to this new condition introduced between ROCM 5.7.1 and 6.0.0 in
src/core/runtime/amd_blit_sdma.cpp
:In particular, this later causes this also newly introduced code to run:
Commenting the last statement
dword_command_addr[total_command_size/4] = (pad_size/4 - 1) << 16;
appears to fix the problem as well. I'm not sure if I'm understanding the code correctly, but I think it is potentially an out-of-bounds write? That would also be consistent with it causing a full GPU reset.Operating System
Arch Linux updated as of 2024-02-08, Linux kernel 6.7.4.
CPU
AMD Ryzen 7 5700G with Radeon Graphics
GPU
Other
Other
AMD Ryzen 7 5700G with Radeon Graphics - i.e. the integrated APU (gfx90c)
ROCm Version
ROCm 6.0.0
ROCm Component
ROCR-Runtime
Steps to Reproduce
Save the following OpenCL application:
Build it:
Run it on a machine with a Ryzen 5700G processor and ROCm 6.0.0:
Expected Behavior: The program should run correctly and print the following output:
Actual Behavior: The program prints:
Then it hangs, and likely causes a GPU reset.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
dmesg output at the time of the crash:
clinfo output: