clinfo hangs on configurations with two AMD GPU and open source rocm

NTMan commented 2 years ago

clinfo hangs in a cycle since it completely occupies one processor core. Same symptoms I observed when launch "DaVinci Resolve". On a desktop with a single Radeon 6900XT GPU, this problem does not occurs.

My configuration: One GPU is internal in the RENOIR processor, and the other is a discrete AMD Radeon 6800M (It laptop ASUS G513QY) In the BIOS there is no ability to turn off the integrated GPU in the processor, so there is no way to check this configuration with each GPU separately.

In the kernel log there is no error so it is most likely a user space issue, but I am not sure about it.

But when I forcibly terminate clinfo (press <Ctrl + C> until in the terminal returned typing) in the kernel log appears follow messages: [ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1962.000912] amdgpu: Failed to evict process queues [ 1962.000918] amdgpu: Failed to quiesce KFD [ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev 00000000b40e7982

I am used open source rocm stack from package rocm-opencl [1] which passed review and already pushed to official Fedora repository [2].

Output clinfo ended with line: Max work group size (AMD) 1024 Full clinfo output you can find here [3] Backtrace clinfo you can find here [4]

The clinfo developer says that the problem is deeper in rocm or kernel [5].

Versions:

# rpm -qa | grep clinfo
clinfo-3.0.21.02.21-3.fc36.x86_64

# rpm -qa | grep rocm
rocm-comgr-5.2.0-1.fc37.x86_64
hsakmt-1.0.6-23.rocm5.2.0.fc37.x86_64
rocm-runtime-5.2.0-1.fc37.x86_64
rocm-opencl-5.2.0-1.fc37.x86_64

[1] https://copr.fedorainfracloud.org/coprs/mystro256/rocm-opencl/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=2090823 [3] https://pastebin.com/TR5zy30Z [4] https://pastebin.com/wv5iGibi [5] https://github.com/Oblomov/clinfo/issues/81

b-sumner commented 2 years ago

Does the same happen with /opt/rocm/bin/clinfo?

NTMan commented 2 years ago

Does the same happen with /opt/rocm/bin/clinfo?

Excuse me, but why clinfo should placed in /opt/rocm/bin ?

$ whereis clinfo 
clinfo: /usr/bin/clinfo /usr/share/man/man1/clinfo.1.gz

$ locate clinfo
/home/mikhail/clinfo-backtrace.txt
/usr/bin/clinfo
/usr/lib/debug/usr/bin/clinfo-3.0.21.02.21-3.fc36.x86_64.debug
/usr/share/doc/clinfo
/usr/share/doc/clinfo/README.md
/usr/share/licenses/clinfo
/usr/share/licenses/clinfo/LICENSE
/usr/share/licenses/clinfo/legalcode.txt
/usr/share/man/man1/clinfo.1.gz
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/clinfo.c
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/error.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/ext.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_loc.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_ret.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/opt_out.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/strbuf.h
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/redhat-linux-build/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo/clinfo.cpp

b-sumner commented 2 years ago

One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.

NTMan commented 2 years ago

One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.

"DaVinci Resolve" has same symptoms (looks like infinite loop which eat 100% CPU)

ROCm / ROCm-OpenCL-Runtime

clinfo hangs on configurations with two AMD GPU and open source rocm #148