NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.21k stars 1.28k forks source link

Periodic stutters and “NVRM: RmCheckForGcxSupportOnCurrentState” kernel warnings on Ubuntu 22.04 RTX 4070 #659

Open edmcman opened 5 months ago

edmcman commented 5 months ago

NVIDIA Open GPU Kernel Modules Version

550.54.15

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

6.5.0-28-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4070 Laptop GPU (UUID: GPU-02187fd8-22a1-3f71-cd52-22af54f42481)

Describe the bug

I’ve been running into occasional visible “stutters” on my Ubuntu Linux 22.04 system. By stutter, I mean that for ~500ms there is no visible change to the screen. If there is a video playing, it freezes. If I am moving the mouse, the cursor will freeze.

At the same time, I get a ton of kernel messages such as:

Apr 24 14:59:16 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:21 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e51c037800 >= 3d63e51c037800
Apr 24 14:59:21 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:21 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:27 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e663d6cf00 >= 3d63e663d6cf00
Apr 24 14:59:27 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:27 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:32 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e7abaa2600 >= 3d63e7abaa2600
Apr 24 14:59:32 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
Apr 24 14:59:32 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0xffff
Apr 24 14:59:38 banana kernel: NVRM: _threadNodeCheckTimeout: _threadNodeCheckTimeout: currentTime: 3d63e8f37d7d00 >= 3d63e8f37d7d00

Another concerning log entry is:

Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: ********************************* GSP Timeout **********************************
Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: Note: Please also check logs above.
Apr 24 13:23:29 banana kernel: NVRM: nvAssertFailedNoLog: Assertion failed: expectedFunc == pHistoryEntry->function @ kernel_gsp.c:1744
Apr 24 13:23:29 banana kernel: NVRM: GPU at PCI:0000:01:00: GPU-02187fd8-22a1-3f71-cd52-22af54f42481
Apr 24 13:23:29 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1671935, name=kworker/1:3, Timeout after 1149s of waiting for RPC response from GPU0 GSP! Expected function 4097 (GSP_INIT_DONE) (0x0 0x0).
Apr 24 13:23:29 banana kernel: NVRM: GPU0 GSP RPC buffer contains function 4108 (UCODE_LIBOS_PRINT) and data 0x0000000000000000 0x0000000000000000.
Apr 24 13:23:29 banana kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Apr 24 13:23:29 banana kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Apr 24 13:23:29 banana kernel: NVRM:      0    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x000616daa95805e8 0x000616daa95d9f17    366 s y
Apr 24 13:23:29 banana kernel: NVRM:     -1    10   FREE                  0x00000000caf010bb 0x0000000000000000 0x000616daa958044a 0x000616daa95805e5    411us  
Apr 24 13:23:29 banana kernel: NVRM:     -2    76   GSP_RM_CONTROL        0x0000000020800ac3 0x0000000000000028 0x000616daa9580260 0x000616daa9580447    487us  
Apr 24 13:23:29 banana kernel: NVRM:     -3    4    ALLOC_MEMORY          0x0000000000000000 0x0000000000000000 0x000616daa957ff81 0x000616daa958025d    732us  
Apr 24 13:23:29 banana kernel: NVRM:     -4    10   FREE                  0x00000000caf010ba 0x0000000000000000 0x000616daa957fd61 0x000616daa957ff79    536us  
Apr 24 13:23:29 banana kernel: NVRM:     -5    76   GSP_RM_CONTROL        0x0000000020800ac3 0x0000000000000028 0x000616daa957fb78 0x000616daa957fd5f    487us  
Apr 24 13:23:29 banana kernel: NVRM:     -6    4    ALLOC_MEMORY          0x0000000000000000 0x0000000000000000 0x000616daa957f982 0x000616daa957fb75    499us  
Apr 24 13:23:29 banana kernel: NVRM:     -7    10   FREE                  0x00000000caf010b9 0x0000000000000000 0x000616daa957f7d9 0x000616daa957f97b    418us  
Apr 24 13:23:29 banana kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Apr 24 13:23:29 banana kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Apr 24 13:23:29 banana kernel: NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daed814621 0x000616daed814622      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -1    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daed8144ef 0x000616daed8144f0      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -2    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x000616daed812e49 0x000616daed812e4b      2us  
Apr 24 13:23:29 banana kernel: NVRM:     -3    4098 GSP_RUN_CPU_SEQUENCER 0x0000000000000628 0x0000000000003fe2 0x000616daed808c11 0x000616daed809d6e   4445us  
Apr 24 13:23:29 banana kernel: NVRM:     -4    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x000616daa958d7c0 0x000616daa958d7c1      1us  
Apr 24 13:23:29 banana kernel: NVRM:     -5    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x000616daa9585c1e 0x000616daa9585c20      2us  
Apr 24 13:23:29 banana kernel: NVRM:     -6    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x000616daa9585a33 0x000616daa9585a33           
Apr 24 13:23:29 banana kernel: NVRM:     -7    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000001 0x000616daa853253d 0x000616daa8532544      7us  
Apr 24 13:23:29 banana kernel: NVRM: _kgspLogXid119: ********************************************************************************
Apr 24 13:23:29 banana kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Call timed out [NV_ERR_TIMEOUT] (0x00000065) returned from rpcRecvPoll(pGpu, pRpc, NV_VGPU_MSG_EVENT_GSP_INIT_DONE) @ kernel_gsp.c:4074
Apr 24 13:23:29 banana kernel: NVRM: gpuPowerManagementResume: State load at resume for riscv/gsp failed: 0x65
Apr 24 13:23:35 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=1671935, name=kworker/1:3, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
Apr 24 13:23:35 banana kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
Apr 24 13:23:35 banana kernel: NVRM: subdeviceCtrlCmdPerfSetPowerstate_KERNEL: NV2080_CTRL_CMD_PERF_SET_POWERSTATE RPC failed
Apr 24 13:23:46 banana kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=854, name=nv_queue, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a7d7 0x2).
Apr 24 13:23:46 banana kernel: NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
Apr 24 13:23:46 banana kernel: NVRM: RmCheckForGcxSupportOnCurrentState: NVRM, Failed to get GCx pre-requisite, status=0x65
Apr 24 13:23:57 banana kernel: NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.

To Reproduce

Unknown. I just use machine for a while and it happens periodically.

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

I have emailed this to linux-bugs@nvidia.com on 4/24/2024

More Info

No response

davelima commented 2 weeks ago

I have the exact same problem. I have also noticed it tends to happend whenever I open a Chromium/Electron app, also it only happens under Wayland. Used X11 for a few weeks and it worked normally.

I'm using Fedora and a RTX 3050, Kernel 6.11.5-300.fc41.x86_64