Closed tpkessler closed 1 month ago
The test tries to allocate a maximum amount of system memory for GPU access. It looks like it ends up invoking the OOM killer. The log snippet in your report only shows it killing kfdtest. That should not have killed your whole login session though. Something else must have broken due to the memory pressure.
KFD limits memory usage by ROCm applications to try and prevent putting so much memory pressure on the system that the OOM killer has to step in. However, we do want applications to be able to use most system memory. So this limit is quite high. Most of the time we test on headless compute-servers. Your situation on a workstation with a display server and other applications running is probably quite different.
It may be worth spending more time to understand the problem. Ideally the oom killer should only kill kfdtest and leave the rest of the system running and responsive.
In the mean time, you can blacklist the problematic test in kfdtest.exclude so it doesn't crash your system everytime you run the test.
Most of the time we test on headless compute-servers.
I see. That may very well explain the failure of certain tests on my machine. In case it matters: I'm running sway on Wayland. EDIT: It's also crashing when running it in without a display server.
you can blacklist the problematic test
I did so (by passing them through the -e
flag). In addition to KFDMemoryTest.LargestSysBufferTest
also KFDMemoryTest.BigSysBufferStressTest
and KFDQMTest.CreateQueueStressSingleThreaded
cause the system to crash. The display manager is killed and I have to restart / re-login.
One performance test in KFDQMTest.BasicCuMaskingEven
was outside the 15% margin. In total, 115 tests passed on my system and one failed.
If CreateQueueStressSingleThreaded causes a crash, the problem is probably that graphics command submissions are timing out because compute is causing too much stress. CreateQueueStressSingleThreaded doesn't use a lot of the compute units, its memory usage shouldn't be extreme, and it shouldn't affect the execution of the graphics queue directly. It does cause lots of TLB and cache flushes and maybe DMA engine load from page table management and memory initialization.
Thanks for your detailed explanation! So you think it's not HSA related by caused by amdgpu?
KFD is technically part of amdgpu.ko. KFD shares the GPU compute resources and VRAM with graphics. So it is possible that using the GPU for compute affects graphics usage negatively. So if you see crashes or hangs in graphics applications while running kfdtest, I would say that HSA is causing the problems.
How can I help you to fix these problems? Do you need better logs?
Hi @tpkessler, sorry for the delay. Are you still experiencing this issue?
Closing this for now, feel free to comment if you are still experiencing this issue or want further guidance and we can reopen it.
Hi! I'm the main contributor to a community-driven effort to build ROCm for Arch Linux (https://github.com/rocm-arch/rocm-arch). Recently, we've started to add the tests provided in the repos to our package building scheme. Running the KFD tests I experience a system crash at
KFDMemoryTest.LargestSysBufferTest
. All applications are killed and after a small break with a black screen I'm greeted by the login manager.First the cmake output
Kernel error log
rocminfo
Kernel version