Open ghost opened 1 year ago
This looks a lot like nvbug 3806304.
I met this on 525.85.12
for A30.
This issue seems to exists on 525.60.13
for A40 and A100 as well. Please fix ASAP!
bug-nv-smi.txt
dmesg.txt
Hi @aritger, is there any solution about this?
I think "Timeout waiting for RPC from GSP!" is a pretty generic symptom, with many possible causes. The reproduction steps that lead up to it will matter to help distinguish different bugs, as will the other NVRM dmesg spew around it.
I don't have specific reason to expect it is already fixed, but it may be worth testing the most recent 525.89.02 driver. 530.xx drivers will hopefully be released soon, and they will have a lot of changes relative to 525.xx, so that will also be worth testing.
Beyond that, if you see similar "Timeout waiting for RPC from GSP!" messages, it is worth attaching a complete nvidia-bug-report.log.gz, and describing the steps that led to it, so that we can compare instances of the symptom.
Thanks @aritger,nvidia-bug-report.log.gz.
The problem occurs after running in the kubernetes environment for a period of time, and nvidia-smi
will get stuck for a while. The specific error phenomenon is similar to @jelmd https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1445190598. FYI, I got some help from nvidia-docker community(https://github.com/NVIDIA/nvidia-docker/issues/1648#issuecomment-1441139460), but not sure if the root cause of this problem is driver or NVIDIA-docker related.
FWIIW: We do not use any nvidia-docker or similar bloat, just plain lxc and passthrough the devices to the related zones alias containers as needed. So IMHO the nvidia-container-toolkit is not really related to the problem.
Happend again on another machine:
...
[ +0.000094] ? _nv011159rm+0x62/0x2e0 [nvidia]
[ +0.000090] ? _nv039897rm+0xdb/0x140 [nvidia]
[ +0.000073] ? _nv041022rm+0x2ce/0x3a0 [nvidia]
[ +0.000103] ? _nv015438rm+0x788/0x800 [nvidia]
[ +0.000064] ? _nv039416rm+0xac/0xe0 [nvidia]
[ +0.000092] ? _nv041024rm+0xac/0x140 [nvidia]
[ +0.000095] ? _nv041023rm+0x37a/0x4d0 [nvidia]
[ +0.000070] ? _nv039319rm+0xc9/0x150 [nvidia]
[ +0.000151] ? _nv039320rm+0x42/0x70 [nvidia]
[ +0.000180] ? _nv000552rm+0x49/0x60 [nvidia]
[ +0.000219] ? _nv000694rm+0x7fb/0xc80 [nvidia]
[ +0.000195] ? rm_ioctl+0x54/0xb0 [nvidia]
[ +0.000132] ? nvidia_ioctl+0x6e3/0x850 [nvidia]
[ +0.000003] ? get_max_files+0x20/0x20
[ +0.000134] ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[ +0.000002] ? do_vfs_ioctl+0x407/0x670
[ +0.000003] ? __secure_computing+0xa4/0x110
[ +0.000002] ? ksys_ioctl+0x67/0x90
[ +0.000002] ? __x64_sys_ioctl+0x1a/0x20
[ +0.000002] ? do_syscall_64+0x57/0x190
[ +0.000002] ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[ +6.010643] NVRM: Xid (PCI:0000:83:00): 119, pid=2710030, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
...
Mon Feb 27 16:08:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:03:00.0 Off | 0 |
| 0% 24C P8 14W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:04:00.0 Off | 0 |
| 0% 25C P8 13W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:43:00.0 Off | 0 |
| 0% 40C P0 81W / 300W | 821MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:44:00.0 Off | 0 |
| 0% 30C P8 14W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 ERR! On | 00000000:83:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 46068MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:84:00.0 Off | 0 |
| 0% 23C P8 14W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:C3:00.0 Off | 0 |
| 0% 22C P8 15W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:C4:00.0 Off | 0 |
| 0% 21C P8 14W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 2 N/A N/A 2111354 C ...eratornet/venv/bin/python 818MiB |
+-----------------------------------------------------------------------------+
@jelmd +1, I ran into this problem again. Hi @aritger @Joshua-Ashton maybe this is a Driver issue, please take a look.
Fri Mar 3 14:41:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:01:00.0 Off | 0 |
| N/A 26C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 Off | 00000000:22:00.0 Off | 0 |
| N/A 40C P0 92W / 165W | 16096MiB / 24576MiB | 88% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 Off | 00000000:41:00.0 Off | 0 |
| N/A 42C P0 141W / 165W | 14992MiB / 24576MiB | 33% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 ERR! Off | 00000000:61:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 23025MiB / 24576MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A30 Off | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 26W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A30 Off | 00000000:A1:00.0 Off | 0 |
| N/A 26C P0 28W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A30 Off | 00000000:C1:00.0 Off | 0 |
| N/A 25C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A30 Off | 00000000:E1:00.0 Off | 0 |
| N/A 24C P0 25W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
[Fri Mar 3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar 3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar 3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar 3 04:23:22 2023] Call Trace:
[Fri Mar 3 04:23:22 2023] dump_stack+0x6b/0x83
[Fri Mar 3 04:23:22 2023] _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar 3 04:23:22 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:23:22 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:23:22 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:23:22 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar 3 04:36:03 2023] Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 3 04:36:03 2023] task:nvidia-smi state:D stack: 0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar 3 04:36:03 2023] Call Trace:
[Fri Mar 3 04:36:03 2023] __schedule+0x282/0x880
[Fri Mar 3 04:36:03 2023] ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar 3 04:36:03 2023] schedule+0x46/0xb0
[Fri Mar 3 04:36:03 2023] rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar 3 04:36:03 2023] os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar 3 04:36:03 2023] _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar 3 04:36:03 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:36:03 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:36:03 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:36:03 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
Also happening here on a A100-PCIE-40GB using driver 530.30.02 and CUDA 12.1.
Hi @lpla , what's your use case environment? In kubernetes?
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
Have you tried the 520.* driver ? Is it work?
FWIW: Most of our users use PyTorch as well. Perhaps it tortures GPUs too hard ;-)
We also use PyTorch on the GPUs, but the 470 driver used before has been more stable.
Yepp. np with 470, too.
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
Have you tried the 520.* driver ? Is it work?
That's my next test. In fact, that's exactly the version I was using before upgrading from Ubuntu 20.04 with kernel 5.15 and driver 520 to Ubuntu 22.04 with kernel 5.19 and driver 525 last month. It was working perfect with that previous setup.
Same thing on another machine. FWIW: I removed /usr/lib/firmware/nvidia/525.60.13
- perhaps this fixes the problem.
They have confirmed Xid 119
this bug. They said that the GSP
feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.
Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz
they advised us to disable GSP-RM
.
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
# if ubuntu
sudo update-initramfs -u
dracut -f
3. Reboot
4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware
Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)
I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.
Driver 525.60.13 with A40 GSP disable.But nvidia-but-report also have GSP timeout error.
Hi @stephenroller @liming5619 , maybe it's better to downgrade the driver version. On one hand, the GSP feature was introduced by NVIDIA since 510 but has not been fixed yet. On the other hand, 470 is an LTS version and has been running stably in our production environment for a long time. I have already downgraded the driver of the problematic node to 470.82.01 to match our other production nodes, just for your reference. :)
So far disabling GSP seems to have mitigated, but maybe I've just been lucky since. Will report back if I see counter-evidence.
Yepp, removing /usr/lib/firmware/nvidia/5xx.* seems to fix the problem, too (did not use NVreg_EnableGpuFirmware=0).
UPDATE
They have confirmed
Xid 119
this bug. They said that theGSP
feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the
nvidia-bug-report.log.gz
they advised us to disableGSP-RM
.1. To disable GSP-RM:
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
2. Enable the kernel
# if ubuntu sudo update-initramfs -u # if centos dracut -f
3. Reboot 4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware
Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)
We disable it on 2 hosts - 8x A100 GPUs, If this workaround will work, I will also give a feedback.
Feedback: After a week, I can say all servers with the A100 boards are running stable after we disabled the GSP. No GPU crashes anymore.
@fighterhit thank you for sharing the workaround with us.
I have a similar issue, after disabling GSP, It took more than 5 minutes to give output as "True".
# cat /etc/modprobe.d/nvidia-gsp.conf
options nvidia NVreg_EnableGpuFirmware=0
# cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
Strange thing is, I'm booting up vms with images which has GPU driver pre-installed, on a host with 4 cards, 2 out of 4 cards ends up with a similar issue.
Please suggest a fix as its hampering our prod environments. Please let me know if any additional commands or log output that i should provide.
We also have few requirements based on CUDA 11.8 and hence we cannot roll back to driver 470
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
Before disabling GSP the error was same as in this post:
NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).
But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
Before disabling GSP the error was same as in this post:
NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL OC) (0x0 0x6c).
But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".
Is there something where I can enable trace, or increase the log level to know the cause of this delay?
Hello everyone, I found that the NVIDIA official driver 525.105.17 recently released seems to have fixed this bug. Anyone who is interested could try it and give feedback on whether it works. :)
Hello everyone, I found that the NVIDIA official driver 525.105.17 recently released seems to have fixed this bug. Anyone who is interested could try it and give feedback on whether it works. :)
Nope, facing the same error
modinfo nvidia|grep version
version: 525.105.17
rhelversion: 8.8
srcversion: 98F82D76E0EF3952EEE57A7
vermagic: 4.18.0-448.el8.x86_64 SMP mod_unload modversions
Apr 14 04:20:41 e2e-105-143 kernel: NVRM: Xid (PCI:0000:01:01): 119, pid=26071, name=python, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0x5c000001 0x0).
Hi @mdrasheek , what's your use case environment? In kubernetes?
our customers would be running ML/DL Workloads using Pytorch. While running "import torch;torch.cuda.is_available();" facing the above Xid 119 error. After doing the workaround to disable GSP firmware, it takes few minutes to give out "True".
Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.
Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.
Yes reboot was done, you can also see the output posted earlier
cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
As i have already mentioned this happens in 2 cards out of 4. In the base machine i have 4 cards, and each vm has 1 card. 2 cards are working as expected, but the other 2 cards shows up the delay to say "True". I agree this is strange, but this is what i have observed.
Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.
May I know which mitigation you mean?
Hello
New driver 525.105.17 (Linux) has bug fixed by Nvidia. Fixed an issue specific to GSP-RM that could lead to GSP RPC timeout errors (Xid 119). The issue was introduced in the first 525 driver release and was not present in earlier drivers. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_525_v3.0.pdf
Hello
New driver 525.105.17 (Linux) has bug fixed by Nvidia. Fixed an issue specific to GSP-RM that could lead to GSP RPC timeout errors (Xid 119). The issue was introduced in the first 525 driver release and was not present in earlier drivers. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_525_v3.0.pdf
Yes @pankajsahtech, I noticed this but @mdrasheek said he still has the same problem(https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1508127902).
@mdrasheek , @fighterhit Is it DGX box. Pls check the matrix for xid errors. It could be that GPU having H/W error. https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
Same issues were raised when using docker with GPU passthrough with the following packages on NVIDIA A40
Looks you are using 530.30.02 beta version, this version not support A40 Card. https://www.nvidia.com/download/driverResults.aspx/199985/en-us/
Pls check version 525.105.17.
@mdrasheek , @fighterhit Is it DGX box. Pls check the matrix for xid errors. It could be that GPU having H/W error. https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
No, Its not DGX. Its A100 80 GB PCI, a VM is using it through pass through.
I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.
The same XID 119 issue for 520.105 driver.
I still encounter this problem on 1 out of 2 A100s after upgrading drive to 530.41.03.
I want to highlight a point here, post enabling MIG and creating a full MIG Slice the GPU works without lag.
nvidia-smi -mig 1 nvidia-smi mig -cgi 0 -C
The above commands will create slice 0 which is the entire 80GB MIG slice. But with MIG mode disabled the lag comes again. I hope this may lead a way for the resolution for some one.
We recently got a ThinkSystem SR670-V2 with 8 A100 80GB cards. We are running 530.30.02 from yum CUDA repo in a SLURM batch. We have had these Xid 119 errors on 3 different GPUs so far running for about 3 weeks with typical TensorFlow jobs.
Going to try disabling GSP as mentioned above.
I still encounter this problem on 1 out of 2 A100s after upgrading drive to 530.41.03.
@mikelou2012 maybe have a try on 525.125.06 with GSP enable
Hi everyone, please try with the following driver:
wget https://us.download.nvidia.com/tesla/535.54.03/NVIDIA-Linux-x86_64-535.54.03.run
As tested from our side, It works as expected without any changes to GSP firmware and MIG in disabled state.
Unforunately ended up with similar issue in 535.54 in one of our machines.
Jul 11 11:42:08 e2e-85-84 kernel: [ 1412.890184] NVRM: Xid (PCI:0000:01:01): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
Jul 11 11:42:14 e2e-85-84 kernel: [ 1418.894227] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:01 (printing 1 of every 30). The GPU likely needs to be reset.
Jul 11 11:42:38 e2e-85-84 kernel: [ 1442.922569] NVRM: Xid (PCI:0000:01:02): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP1! Expected function 10 (FREE) (0x5c00000c 0x0).
Jul 11 11:42:44 e2e-85-84 kernel: [ 1448.926199] NVRM: Xid (PCI:0000:01:02): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP1! Expected function 10 (FREE) (0x5c00000b 0x0).
Jul 11 11:42:50 e2e-85-84 kernel: [ 1454.930190] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:02 (printing 1 of every 30). The GPU likely needs to be reset.```
Even after running gpu-reset, facing similar error
```Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457563] NVRM: GPU at PCI:0000:01:02: GPU-6e4d9996-6edf-b18c-debb-269edc01a143
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457570] NVRM: GPU Board Serial Number: 1324220003349
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457571] NVRM: Xid (PCI:0000:01:02): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP1! Expected function 76 (GSP_RM_CONTROL) (0x20803032 0x58c).
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457585] CPU: 11 PID: 10274 Comm: python3 Tainted: P OE 5.15.0-69-generic #76~20.04.1-Ubuntu
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457588] Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457590] Call Trace:
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457592] <TASK>
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457595] dump_stack_lvl+0x4a/0x63
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457601] dump_stack+0x10/0x16
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457605] os_dump_stack+0xe/0x14 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457883] _nv011486rm+0x3ff/0x480 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458325] ? _nv011409rm+0x5d/0x310 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458654] ? _nv043706rm+0x4b4/0x6e0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458996] ? _nv033719rm+0x64/0x130 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.459416] ? _nv045949rm+0x10d/0xbe0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.459841] ? _nv042963rm+0x1a9/0x1b0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460154] ? _nv044912rm+0x1f1/0x300 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460464] ? _nv013130rm+0x335/0x630 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460745] ? _nv043107rm+0x69/0xd0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461030] ? _nv011651rm+0x86/0xa0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461305] ? _nv000714rm+0x9c1/0xe70 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461578] ? rm_ioctl+0x58/0xb0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461849] ? nvidia_ioctl+0x710/0x870 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462097] ? do_syscall_64+0x69/0xc0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462102] ? nvidia_frontend_unlocked_ioctl+0x58/0x90 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462353] ? __x64_sys_ioctl+0x95/0xd0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462357] ? do_syscall_64+0x5c/0xc0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462359] ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462363] </TASK>
Jul 11 12:53:52 e2e-85-84 kernel: [ 5716.465188] NVRM: Xid (PCI:0000:01:01): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
Jul 11 12:53:58 e2e-85-84 kernel: [ 5722.468809] NVRM: Xid (PCI:0000:01:01): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
Jul 11 12:54:04 e2e-85-84 kernel: [ 5728.472326] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:01 (printing 1 of every 30). The GPU likely needs to be reset.```
NVIDIA Open GPU Kernel Modules Version
525.85.05
Does this happen with the proprietary driver (of the same version) as well?
I cannot test this
Operating System and Version
Arch Linux
Kernel Release
Linux [HOSTNAME] 6.1.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 24 Jan 2023 21:07:04 +0000 x86_64 GNU/Linux
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-071149ae-386e-0017-3b5b-7ea80801f725)
Describe the bug
When I open a OpenGL application, like Yamagi Quake II, at a certain point the whole system freezes, and run in like 1 FPS per second. I generally have to REISUB when this happens.
To Reproduce
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
Related: #272