Open NTMan opened 2 years ago
Does the same happen with /opt/rocm/bin/clinfo?
Does the same happen with /opt/rocm/bin/clinfo?
Excuse me, but why clinfo should placed in /opt/rocm/bin ?
$ whereis clinfo
clinfo: /usr/bin/clinfo /usr/share/man/man1/clinfo.1.gz
$ locate clinfo
/home/mikhail/clinfo-backtrace.txt
/usr/bin/clinfo
/usr/lib/debug/usr/bin/clinfo-3.0.21.02.21-3.fc36.x86_64.debug
/usr/share/doc/clinfo
/usr/share/doc/clinfo/README.md
/usr/share/licenses/clinfo
/usr/share/licenses/clinfo/LICENSE
/usr/share/licenses/clinfo/legalcode.txt
/usr/share/man/man1/clinfo.1.gz
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/clinfo.c
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/error.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/ext.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_loc.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/info_ret.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/opt_out.h
/usr/src/debug/clinfo-3.0.21.02.21-3.fc36.x86_64/src/strbuf.h
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/redhat-linux-build/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo
/usr/src/debug/rocm-opencl-5.2.0-1.fc37.x86_64/tools/clinfo/clinfo.cpp
One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.
One reason is that AMD wrote its own clinfo back in the days of OpenCL 1.0, long before any other implementations appeared on github and were picked up by the distros, and has maintained it since.
"DaVinci Resolve" has same symptoms (looks like infinite loop which eat 100% CPU)
clinfo hangs in a cycle since it completely occupies one processor core. Same symptoms I observed when launch "DaVinci Resolve". On a desktop with a single Radeon 6900XT GPU, this problem does not occurs.
My configuration: One GPU is internal in the RENOIR processor, and the other is a discrete AMD Radeon 6800M (It laptop ASUS G513QY) In the BIOS there is no ability to turn off the integrated GPU in the processor, so there is no way to check this configuration with each GPU separately.
In the kernel log there is no error so it is most likely a user space issue, but I am not sure about it.
But when I forcibly terminate clinfo (press <Ctrl + C> until in the terminal returned typing) in the kernel log appears follow messages: [ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1962.000912] amdgpu: Failed to evict process queues [ 1962.000918] amdgpu: Failed to quiesce KFD [ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out [ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev 00000000b40e7982
I am used open source rocm stack from package rocm-opencl [1] which passed review and already pushed to official Fedora repository [2].
Output clinfo ended with line:
Max work group size (AMD) 1024
Full clinfo output you can find here [3] Backtrace clinfo you can find here [4]The clinfo developer says that the problem is deeper in rocm or kernel [5].
Versions:
[1] https://copr.fedorainfracloud.org/coprs/mystro256/rocm-opencl/ [2] https://bugzilla.redhat.com/show_bug.cgi?id=2090823 [3] https://pastebin.com/TR5zy30Z [4] https://pastebin.com/wv5iGibi [5] https://github.com/Oblomov/clinfo/issues/81