Open bxkx opened 2 years ago
nvtop
uses the fdinfo interface, but since this is a kernel stack trace, a series of event most likely triggered a deadlock inside the kernel.
This may have started within amdgpu_show_fdinfo
(from the kernel DRM code), which is triggered when nvtop reads files in /proc/
My bet is that since you were running benchmarks, the process finished before amdgpu_show_fdinfo
was done doing its thing: the kernel de-allocated the structs associated with the process while amdgpu_show_fdinfo
tried to access them from another thread.
Could you please open a bug report on the Gitlab for DRM/AMD?
So do you think this is most likely a driver bug? I just recently got the GPU and now I'm concerned it might be faulty
I found a few commits around the amdgpu_vm_get_memory stuff that aren't in the kernel yet like "Use vm status_lock to protect relocated list" here https://cgit.freedesktop.org/drm/drm/log/ - Could those be related? It's really annoying that I can't find anyone else having this kind of crash log
I'm planning to do a bug report in the Gitlab but they haven't activated my account yet.
It's not that there's a sequence of events leading to a deadlock, but that NULL pointer deref forcibly killed the task holding the lock, and no other task will unlock that lock. And since it's a spinlock there won't be preemption and soft lockup occurs; the CPUs stuck waiting for the lock is practically dead other than processing interrupts.
Anyways there isn't much nvtop can do about this. NULL pointer deref in kernel mode should not happen and if it does it's a kernel/driver bug.
I started having the same issue on RX7900XT with 6.4.2-zen1-1-zen recently.
I experienced a GPU crash with nvtop running in the background when I did a GPU benchmark. The screen froze but system was still running. I've tried changing TTY, that changed what I saw on the screen but froze as well. The system seemed to still run until I hit the reset button (it was logging up until that point)
I'm not entirely sure if it is related to nvtop but since it had nvtop in the log and I couldn't find anyone else with a crash log with amdgpu_bo_get_memory, amdgpu_vm_get_memory, amdgpu_show_fdinfo stuff from the call trace when I googled I thought I'd just try it here too. I'll do a kernel bug report, too.
Using a RX 6600
Here's the journallog
``` Oct 01 19:43:27 pc kernel: BUG: kernel NULL pointer dereference, address: 0000000000000010 Oct 01 19:43:27 pc kernel: #PF: supervisor read access in kernel mode Oct 01 19:43:27 pc kernel: #PF: error_code(0x0000) - not-present page Oct 01 19:43:27 pc kernel: PGD 0 P4D 0 Oct 01 19:43:27 pc kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Oct 01 19:43:27 pc kernel: CPU: 10 PID: 5844 Comm: nvtop Not tainted 5.19.12-arch1-1 #1 2183db5e2ff49b915549bc42a3e56ec968f6996b Oct 01 19:43:27 pc kernel: Hardware name: Gigabyte Technology Co., Ltd. B660M DS3H DDR4/B660M DS3H DDR4, BIOS F5 01/17/2022 Oct 01 19:43:27 pc kernel: RIP: 0010:amdgpu_bo_get_memory+0x17/0x50 [amdgpu] Oct 01 19:43:27 pc kernel: Code: 44 55 c6 e9 6c ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 00 0f 1f 44 00 00 48 8b 87 b0 01 00 00 48 8b bf 30 01 00 00 <8b> 40 10 83 f8 05 77 19 8b 04 85 c0 e2 0c c1 83 f8 02 74 15 83 f8 Oct 01 19:43:27 pc kernel: RSP: 0018:ffff9f0143f63c38 EFLAGS: 00010286 Oct 01 19:43:27 pc kernel: RAX: 0000000000000000 RBX: ffff9f0143f63c98 RCX: ffff9f0143f63ca8 Oct 01 19:43:27 pc kernel: RDX: ffff9f0143f63ca0 RSI: ffff9f0143f63c98 RDI: 0000000000200000 Oct 01 19:43:27 pc kernel: RBP: ffff9f0143f63ca0 R08: 0000000000ffff0a R09: 0000000000000002 Oct 01 19:43:27 pc kernel: R10: 0000000000000007 R11: ffff93e95597102a R12: ffff9f0143f63ca8 Oct 01 19:43:27 pc kernel: R13: ffff93e97a7370a0 R14: ffff93e97a737060 R15: ffff93e944956200 Oct 01 19:43:27 pc kernel: FS: 00007f042b983b80(0000) GS:ffff93ecefa80000(0000) knlGS:0000000000000000 Oct 01 19:43:27 pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 01 19:43:27 pc kernel: CR2: 0000000000000010 CR3: 000000027db18006 CR4: 0000000000f70ee0 Oct 01 19:43:27 pc kernel: PKRU: 55555554 Oct 01 19:43:27 pc kernel: Call Trace: Oct 01 19:43:27 pc kernel: