ghost commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

525.85.05

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Arch Linux

Kernel Release

Linux [HOSTNAME] 6.1.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 24 Jan 2023 21:07:04 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-071149ae-386e-0017-3b5b-7ea80801f725)

Describe the bug

When I open a OpenGL application, like Yamagi Quake II, at a certain point the whole system freezes, and run in like 1 FPS per second. I generally have to REISUB when this happens.

To Reproduce

Open Yamagi Quake II
Change workspace, open pavucontrol to select a new audio sink for the game, switch back

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

Related: #272

ttabi commented 1 year ago

This looks a lot like nvbug 3806304.

fighterhit commented 1 year ago

I met this on 525.85.12 for A30.

jelmd commented 1 year ago

This issue seems to exists on 525.60.13 for A40 and A100 as well. Please fix ASAP! bug-nv-smi.txt dmesg.txt

fighterhit commented 1 year ago

Hi @aritger, is there any solution about this?

aritger commented 1 year ago

I think "Timeout waiting for RPC from GSP!" is a pretty generic symptom, with many possible causes. The reproduction steps that lead up to it will matter to help distinguish different bugs, as will the other NVRM dmesg spew around it.

I don't have specific reason to expect it is already fixed, but it may be worth testing the most recent 525.89.02 driver. 530.xx drivers will hopefully be released soon, and they will have a lot of changes relative to 525.xx, so that will also be worth testing.

Beyond that, if you see similar "Timeout waiting for RPC from GSP!" messages, it is worth attaching a complete nvidia-bug-report.log.gz, and describing the steps that led to it, so that we can compare instances of the symptom.

fighterhit commented 1 year ago

Thanks @aritger，nvidia-bug-report.log.gz.

The problem occurs after running in the kubernetes environment for a period of time, and nvidia-smi will get stuck for a while. The specific error phenomenon is similar to @jelmd https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1445190598. FYI, I got some help from nvidia-docker community(https://github.com/NVIDIA/nvidia-docker/issues/1648#issuecomment-1441139460), but not sure if the root cause of this problem is driver or NVIDIA-docker related.

jelmd commented 1 year ago

FWIIW: We do not use any nvidia-docker or similar bloat, just plain lxc and passthrough the devices to the related zones alias containers as needed. So IMHO the nvidia-container-toolkit is not really related to the problem.

jelmd commented 1 year ago

Happend again on another machine:

...
[  +0.000094]  ? _nv011159rm+0x62/0x2e0 [nvidia]
[  +0.000090]  ? _nv039897rm+0xdb/0x140 [nvidia]
[  +0.000073]  ? _nv041022rm+0x2ce/0x3a0 [nvidia]
[  +0.000103]  ? _nv015438rm+0x788/0x800 [nvidia]
[  +0.000064]  ? _nv039416rm+0xac/0xe0 [nvidia]
[  +0.000092]  ? _nv041024rm+0xac/0x140 [nvidia]
[  +0.000095]  ? _nv041023rm+0x37a/0x4d0 [nvidia]
[  +0.000070]  ? _nv039319rm+0xc9/0x150 [nvidia]
[  +0.000151]  ? _nv039320rm+0x42/0x70 [nvidia]
[  +0.000180]  ? _nv000552rm+0x49/0x60 [nvidia]
[  +0.000219]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[  +0.000195]  ? rm_ioctl+0x54/0xb0 [nvidia]
[  +0.000132]  ? nvidia_ioctl+0x6e3/0x850 [nvidia]
[  +0.000003]  ? get_max_files+0x20/0x20
[  +0.000134]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[  +0.000002]  ? do_vfs_ioctl+0x407/0x670
[  +0.000003]  ? __secure_computing+0xa4/0x110
[  +0.000002]  ? ksys_ioctl+0x67/0x90
[  +0.000002]  ? __x64_sys_ioctl+0x1a/0x20
[  +0.000002]  ? do_syscall_64+0x57/0x190
[  +0.000002]  ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[  +6.010643] NVRM: Xid (PCI:0000:83:00): 119, pid=2710030, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
...
Mon Feb 27 16:08:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:03:00.0 Off |                    0 |
|  0%   24C    P8    14W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:04:00.0 Off |                    0 |
|  0%   25C    P8    13W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:43:00.0 Off |                    0 |
|  0%   40C    P0    81W / 300W |    821MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:44:00.0 Off |                    0 |
|  0%   30C    P8    14W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  ERR!                On   | 00000000:83:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 46068MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          On   | 00000000:84:00.0 Off |                    0 |
|  0%   23C    P8    14W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          On   | 00000000:C3:00.0 Off |                    0 |
|  0%   22C    P8    15W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          On   | 00000000:C4:00.0 Off |                    0 |
|  0%   21C    P8    14W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    2   N/A  N/A   2111354      C   ...eratornet/venv/bin/python      818MiB |
+-----------------------------------------------------------------------------+

fighterhit commented 1 year ago

@jelmd +1, I ran into this problem again. Hi @aritger @Joshua-Ashton maybe this is a Driver issue, please take a look.

Fri Mar  3 14:41:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:01:00.0 Off |                    0 |
| N/A   26C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A30          Off  | 00000000:22:00.0 Off |                    0 |
| N/A   40C    P0    92W / 165W |  16096MiB / 24576MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A30          Off  | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0   141W / 165W |  14992MiB / 24576MiB |     33%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  ERR!                Off  | 00000000:61:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |  23025MiB / 24576MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A30          Off  | 00000000:81:00.0 Off |                    0 |
| N/A   26C    P0    26W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A30          Off  | 00000000:A1:00.0 Off |                    0 |
| N/A   26C    P0    28W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A30          Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   25C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A30          Off  | 00000000:E1:00.0 Off |                    0 |
| N/A   24C    P0    25W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

[Fri Mar  3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar  3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar  3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar  3 04:23:22 2023] Call Trace:
[Fri Mar  3 04:23:22 2023]  dump_stack+0x6b/0x83
[Fri Mar  3 04:23:22 2023]  _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:23:22 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:23:22 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar  3 04:36:03 2023]       Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar  3 04:36:03 2023] task:nvidia-smi      state:D stack:    0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar  3 04:36:03 2023] Call Trace:
[Fri Mar  3 04:36:03 2023]  __schedule+0x282/0x880
[Fri Mar  3 04:36:03 2023]  ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar  3 04:36:03 2023]  schedule+0x46/0xb0
[Fri Mar  3 04:36:03 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar  3 04:36:03 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar  3 04:36:03 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:36:03 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:36:03 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).

lpla commented 1 year ago

Also happening here on a A100-PCIE-40GB using driver 530.30.02 and CUDA 12.1.

fighterhit commented 1 year ago

Hi @lpla , what's your use case environment? In kubernetes?

lpla commented 1 year ago

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

fighterhit commented 1 year ago

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

Have you tried the 520.* driver ? Is it work?

jelmd commented 1 year ago

FWIW: Most of our users use PyTorch as well. Perhaps it tortures GPUs too hard ;-)

fighterhit commented 1 year ago

We also use PyTorch on the GPUs, but the 470 driver used before has been more stable.

jelmd commented 1 year ago

Yepp. np with 470, too.

lpla commented 1 year ago

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

Have you tried the 520.* driver ? Is it work?

That's my next test. In fact, that's exactly the version I was using before upgrading from Ubuntu 20.04 with kernel 5.15 and driver 520 to Ubuntu 22.04 with kernel 5.19 and driver 525 last month. It was working perfect with that previous setup.

jelmd commented 1 year ago

Same thing on another machine. FWIW: I removed /usr/lib/firmware/nvidia/525.60.13 - perhaps this fixes the problem.

fighterhit commented 1 year ago

UPDATE

They have confirmed Xid 119 this bug. They said that the GSP feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.

Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz they advised us to disable GSP-RM.

To disable GSP-RM:

sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'

Enable the kernel
```
# if ubuntu
sudo update-initramfs -u
```

if centos

dracut -f

3. Reboot
4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.

cat /proc/driver/nvidia/params | grep EnableGpuFirmware



Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)

stephenroller commented 1 year ago

I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.

liming5619 commented 1 year ago

Driver 525.60.13 with A40 GSP disable.But nvidia-but-report also have GSP timeout error.

fighterhit commented 1 year ago

Hi @stephenroller @liming5619 , maybe it's better to downgrade the driver version. On one hand, the GSP feature was introduced by NVIDIA since 510 but has not been fixed yet. On the other hand, 470 is an LTS version and has been running stably in our production environment for a long time. I have already downgraded the driver of the problematic node to 470.82.01 to match our other production nodes, just for your reference. :)

stephenroller commented 1 year ago

So far disabling GSP seems to have mitigated, but maybe I've just been lucky since. Will report back if I see counter-evidence.

jelmd commented 1 year ago

Yepp, removing /usr/lib/firmware/nvidia/5xx.* seems to fix the problem, too (did not use NVreg_EnableGpuFirmware=0).

edzoe commented 1 year ago

UPDATE

They have confirmed Xid 119 this bug. They said that the GSP feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.

Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz they advised us to disable GSP-RM.
1. To disable GSP-RM:
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
2. Enable the kernel
# if ubuntu
sudo update-initramfs -u

# if centos
dracut -f 
3. Reboot

4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware
Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)

We disable it on 2 hosts - 8x A100 GPUs, If this workaround will work, I will also give a feedback.

edzoe commented 1 year ago

Feedback: After a week, I can say all servers with the A100 boards are running stable after we disabled the GSP. No GPU crashes anymore.

@fighterhit thank you for sharing the workaround with us.

mdrasheek commented 1 year ago

I have a similar issue, after disabling GSP, It took more than 5 minutes to give output as "True".

# cat /etc/modprobe.d/nvidia-gsp.conf
options nvidia NVreg_EnableGpuFirmware=0
# cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2

Strange thing is, I'm booting up vms with images which has GPU driver pre-installed, on a host with 4 cards, 2 out of 4 cards ends up with a similar issue.

Please suggest a fix as its hampering our prod environments. Please let me know if any additional commands or log output that i should provide.

We also have few requirements based on CUDA 11.8 and hence we cannot roll back to driver 470

yskan commented 1 year ago

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

mdrasheek commented 1 year ago

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

Before disabling GSP the error was same as in this post:

NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).

But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".

mdrasheek commented 1 year ago

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

Before disabling GSP the error was same as in this post:
NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).
But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".

Is there something where I can enable trace, or increase the log level to know the cause of this delay?

fighterhit commented 1 year ago

Hello everyone, I found that the NVIDIA official driver 525.105.17 recently released seems to have fixed this bug. Anyone who is interested could try it and give feedback on whether it works. :)

mdrasheek commented 1 year ago

Hello everyone, I found that the NVIDIA official driver 525.105.17 recently released seems to have fixed this bug. Anyone who is interested could try it and give feedback on whether it works. :)

Nope, facing the same error

modinfo nvidia|grep version
version:        525.105.17
rhelversion:    8.8
srcversion:     98F82D76E0EF3952EEE57A7
vermagic:       4.18.0-448.el8.x86_64 SMP mod_unload modversions

Apr 14 04:20:41 e2e-105-143 kernel: NVRM: Xid (PCI:0000:01:01): 119, pid=26071, name=python, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0x5c000001 0x0).

fighterhit commented 1 year ago

Hi @mdrasheek , what's your use case environment? In kubernetes?

mdrasheek commented 1 year ago

our customers would be running ML/DL Workloads using Pytorch. While running "import torch;torch.cuda.is_available();" facing the above Xid 119 error. After doing the workaround to disable GSP firmware, it takes few minutes to give out "True".

stephenroller commented 1 year ago

Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.

mdrasheek commented 1 year ago

Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.

Yes reboot was done, you can also see the output posted earlier

cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2

As i have already mentioned this happens in 2 cards out of 4. In the base machine i have 4 cards, and each vm has 1 card. 2 cards are working as expected, but the other 2 cards shows up the delay to say "True". I agree this is strange, but this is what i have observed.

mdrasheek commented 1 year ago

Did you reboot after applying the mitigation? I haven't seen any issues in pytorch since the first reboot without GSP.

May I know which mitigation you mean?

pankajsahtech commented 1 year ago

Hello

New driver 525.105.17 (Linux) has bug fixed by Nvidia. Fixed an issue specific to GSP-RM that could lead to GSP RPC timeout errors (Xid 119). The issue was introduced in the first 525 driver release and was not present in earlier drivers. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_525_v3.0.pdf

fighterhit commented 1 year ago

Hello

New driver 525.105.17 (Linux) has bug fixed by Nvidia. Fixed an issue specific to GSP-RM that could lead to GSP RPC timeout errors (Xid 119). The issue was introduced in the first 525 driver release and was not present in earlier drivers. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Data_Center_GPU_Driver_Release_Notes_525_v3.0.pdf

Yes @pankajsahtech, I noticed this but @mdrasheek said he still has the same problem(https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1508127902).

pankajsahtech commented 1 year ago

@mdrasheek , @fighterhit Is it DGX box. Pls check the matrix for xid errors. It could be that GPU having H/W error. https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

Tverous commented 1 year ago

Same issues were raised when using docker with GPU passthrough with the following packages on NVIDIA A40

cuda: 12.1
nvidia-driver-cuda.x86_64: 530.30.02
nvidia-container-toolkit.x86_64: 1.13.0-1
docker-ce.x86_64: 23.0.3-1.el8

pankajsahtech commented 1 year ago

Looks you are using 530.30.02 beta version, this version not support A40 Card. https://www.nvidia.com/download/driverResults.aspx/199985/en-us/

Pls check version 525.105.17.

mdrasheek commented 1 year ago

@mdrasheek , @fighterhit Is it DGX box. Pls check the matrix for xid errors. It could be that GPU having H/W error. https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4

No, Its not DGX. Its A100 80 GB PCI, a VM is using it through pass through.

mikelou2012 commented 1 year ago

I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.

The same XID 119 issue for 520.105 driver.

zldrobit commented 1 year ago

I still encounter this problem on 1 out of 2 A100s after upgrading drive to 530.41.03.

mdrasheek commented 1 year ago

I want to highlight a point here, post enabling MIG and creating a full MIG Slice the GPU works without lag.

nvidia-smi -mig 1 nvidia-smi mig -cgi 0 -C

The above commands will create slice 0 which is the entire 80GB MIG slice. But with MIG mode disabled the lag comes again. I hope this may lead a way for the resolution for some one.

paulraines68 commented 1 year ago

We recently got a ThinkSystem SR670-V2 with 8 A100 80GB cards. We are running 530.30.02 from yum CUDA repo in a SLURM batch. We have had these Xid 119 errors on 3 different GPUs so far running for about 3 weeks with typical TensorFlow jobs.

Going to try disabling GSP as mentioned above.

Davidrjx commented 1 year ago

I still encounter this problem on 1 out of 2 A100s after upgrading drive to 530.41.03.

@mikelou2012 maybe have a try on 525.125.06 with GSP enable

mdrasheek commented 1 year ago

Hi everyone, please try with the following driver:

wget https://us.download.nvidia.com/tesla/535.54.03/NVIDIA-Linux-x86_64-535.54.03.run

As tested from our side, It works as expected without any changes to GSP firmware and MIG in disabled state.

mdrasheek commented 1 year ago

Unforunately ended up with similar issue in 535.54 in one of our machines.


Jul 11 11:42:08 e2e-85-84 kernel: [ 1412.890184] NVRM: Xid (PCI:0000:01:01): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
Jul 11 11:42:14 e2e-85-84 kernel: [ 1418.894227] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:01 (printing 1 of every 30).  The GPU likely needs to be reset.
Jul 11 11:42:38 e2e-85-84 kernel: [ 1442.922569] NVRM: Xid (PCI:0000:01:02): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP1! Expected function 10 (FREE) (0x5c00000c 0x0).
Jul 11 11:42:44 e2e-85-84 kernel: [ 1448.926199] NVRM: Xid (PCI:0000:01:02): 119, pid=3395, name=python3, Timeout waiting for RPC from GSP1! Expected function 10 (FREE) (0x5c00000b 0x0).
Jul 11 11:42:50 e2e-85-84 kernel: [ 1454.930190] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:02 (printing 1 of every 30).  The GPU likely needs to be reset.```

Even after running gpu-reset, facing similar error

```Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457563] NVRM: GPU at PCI:0000:01:02: GPU-6e4d9996-6edf-b18c-debb-269edc01a143
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457570] NVRM: GPU Board Serial Number: 1324220003349
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457571] NVRM: Xid (PCI:0000:01:02): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP1! Expected function 76 (GSP_RM_CONTROL) (0x20803032 0x58c).
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457585] CPU: 11 PID: 10274 Comm: python3 Tainted: P           OE     5.15.0-69-generic #76~20.04.1-Ubuntu
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457588] Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457590] Call Trace:
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457592]  <TASK>
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457595]  dump_stack_lvl+0x4a/0x63
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457601]  dump_stack+0x10/0x16
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457605]  os_dump_stack+0xe/0x14 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.457883]  _nv011486rm+0x3ff/0x480 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458325]  ? _nv011409rm+0x5d/0x310 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458654]  ? _nv043706rm+0x4b4/0x6e0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.458996]  ? _nv033719rm+0x64/0x130 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.459416]  ? _nv045949rm+0x10d/0xbe0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.459841]  ? _nv042963rm+0x1a9/0x1b0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460154]  ? _nv044912rm+0x1f1/0x300 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460464]  ? _nv013130rm+0x335/0x630 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.460745]  ? _nv043107rm+0x69/0xd0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461030]  ? _nv011651rm+0x86/0xa0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461305]  ? _nv000714rm+0x9c1/0xe70 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461578]  ? rm_ioctl+0x58/0xb0 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.461849]  ? nvidia_ioctl+0x710/0x870 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462097]  ? do_syscall_64+0x69/0xc0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462102]  ? nvidia_frontend_unlocked_ioctl+0x58/0x90 [nvidia]
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462353]  ? __x64_sys_ioctl+0x95/0xd0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462357]  ? do_syscall_64+0x5c/0xc0
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462359]  ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jul 11 12:53:46 e2e-85-84 kernel: [ 5710.462363]  </TASK>
Jul 11 12:53:52 e2e-85-84 kernel: [ 5716.465188] NVRM: Xid (PCI:0000:01:01): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
Jul 11 12:53:58 e2e-85-84 kernel: [ 5722.468809] NVRM: Xid (PCI:0000:01:01): 119, pid=10274, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
Jul 11 12:54:04 e2e-85-84 kernel: [ 5728.472326] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:01 (printing 1 of every 30).  The GPU likely needs to be reset.```

NVIDIA / open-gpu-kernel-modules

Timeout waiting for RPC from GSP! #446

NVIDIA Open GPU Kernel Modules Version

Does this happen with the proprietary driver (of the same version) as well?

Operating System and Version

Kernel Release

Hardware: GPU

Describe the bug

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

UPDATE

if centos

UPDATE