NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.14k stars 1.26k forks source link

NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0 #632

Open scaronni opened 5 months ago

scaronni commented 5 months ago

NVIDIA Open GPU Kernel Modules Version

550.78

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Fedora 40

Kernel Release

6.8.7-300.fc40.x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 4070 SUPER

Describe the bug

Kernel messages being spammed by these lines:

[ 6614.717414] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 0000f82a
[ 6614.717420] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0

After a few iterations of the two, it keeps spamming NVRM: krcWatchdogCallbackVblankRecovery_IMPL [...].

To Reproduce

Just boot the system with the open kernel modules installed.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

luzat commented 5 months ago

I can confirm that behavior with driver 550.54.15-1 (newest from CUDA repo), Debian unstable, custom-built kernels of at least versions 6.8.9, 6.9.0 and 6.9.1, a GeForce RTX 4090 and 5 displays connected. For me, it happens on head C0.

The displays are 2 DP screens (both running), 1 Valve Index VR headset on DP (not running), 1 HDMI screen (not connected to power), and 1 HDMI TV by LG (turned off or on).

The messages repeat very close to every 8.192s seconds and stop at some point (after ~40 minutes this time, not sure if consistent).

The error only occurs when the LG TV is a) connected and b) not enabled in X. Not sure if the message does indicate some actual problem, but I would prefer not to have my logs flooded with the message.

mtijanic commented 5 months ago

I believe this should be fixed with 555.42.02. This is the relevant change so you can apply it to 550.xx as well:

diff --git a/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c b/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
index 50e14fa..5da4a43 100644
--- a/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
+++ b/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
@@ -235,7 +235,8 @@ kheadReadVblankIntrState_IMPL
 )
 {
     // Check to make sure that our SW state grooves with the HW state
-    if (kheadReadVblankIntrEnable_HAL(pGpu, pKernelHead))
+    if (kheadReadVblankIntrEnable_HAL(pGpu, pKernelHead) &&
+            kheadGetDisplayInitialized_HAL(pGpu, pKernelHead))
     {
         // HW is enabled, check if SW state is not enabled
         if (pKernelHead->Vblank.IntrState != NV_HEAD_VBLANK_INTR_ENABLED)
luzat commented 5 months ago

I believe this should be fixed with 555.42.02. This is the relevant change so you can apply it to 550.xx as well:

Thanks! I did not try to apply the patch, but the upgrade to 550.42.02, that is now packaged, fixes the issue for me.