NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.05k stars 1.25k forks source link

Animations after idling are noticeably choppy until GPU ramps up with GSP firmware enabled #693

Open Gert-dev opened 1 month ago

Gert-dev commented 1 month ago

NVIDIA Open GPU Kernel Modules Version

555.58.02

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Arch Linux

Kernel Release

6.10.5

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

NVIDIA GeForce RTX 4090 Laptop GPU

Describe the bug

In mutter 46.4 on Wayland on a single-GPU NVIDIA system, there is always a 'jank' when idling for a few seconds and then switching virtual desktops. If you then start frantically switching desktops (i.e. triggering animations), after a second or two it becomes buttery smooth (feels like 144 FPS on a 144 Hz monitor), which is likely due to the GPU ramping up.

This doesn't happen on a high-refresh-rate display (240 Hz) being driven by an AMD iGPU in my case. On an Intel GPU it also happens but is fixed by applying triple buffering.

On the single-GPU NVIDIA system using triple buffering doesn't seem to resolve the issue despite the problems being similar to what it attempts to fix for Intel.

This was originally reported to mutter, but it turns out this is related to the GSP firmware used by the open kernel modules since the closed driver without GSP enabled doesn't experience this issue.

To Reproduce

  1. Start GNOME on Wayland on a high-refresh rate monitor (e.g. 120 Hz or higher).
  2. Wait about 5 seconds.
  3. Switch to a virtual desktop right.
  4. Notice the animation being choppy.
  5. Switch left and right about 10 times in a couple of seconds.
  6. Notice how the animation becomes smooth.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

ptr1337 commented 1 month ago

You maybe want to retest it with the 560 Driver, since there has been a bunch of fixes for this issue.

Also, check if there is any program in the background, which is calling "nvidia-smi". Some examples here:

Gert-dev commented 1 month ago

Thanks for the response. I tested the 560 beta driver and it's noticeably better, so already nice to hear that these improvements are coming soon!

I also noticed now that the problem seems to be exacerbated when GNOME's power saving mode is active - I have it active often because I have a Lenovo Legion and power saving mode equates to 'quiet mode' where the fans don't make as much noise - the GPU power limit is still set to 80 Watt in both cases, though.

I've been observing the output of nvidia-smi when it's active or inactive ('normal' mode) on the 560 driver:

I understand power saving mode may imply making sacrifices to save power, but since it's a 4090 I kind of still expected to have enough 'juice' to cover at least a smooth desktop. It's also a bit strange that the GPU is not allowed to go below 7 Watt in normal mode, since it's apparently completely possible in power saving mode, so it feels as normal mode tapes over the problem a bit in the sense that it's just keeping the GPU at higher power levels permanently to circumvent the ramp-up issue.

(FWIW, I also have no other applications running in the background during these tests. I use nvidia-smi to see stats sometimes, but the jank resulting from calling it is much more noticeable than and different from the 'fast stutter' of the ramp-up.)

mtijanic commented 1 month ago

Hey there, thanks for the report!

This seems to be a mix of problems, but a big part of it is likely the known issue of power transitions being a bit slower, per the driver readme:

Known Issues The following are some known limitations of the open kernel modules versus the proprietary kernel modules with GSP firmware mode disabled:

  • GPU initialization is slower. One possible mitigation is to use nvidia-persistenced to initialize the GPU(s) in advance, before running applications that use the GPU.
  • Enter and exit latencies for power-saving modes like S3, S4 and Run Time D3 (RTD3) can be longer due to additional GSP state being restored.
  • GPU power consumption can be marginally impacted in some scenarios.
  • Run Time D3 (RTD3) is only supported on Ampere and above GPUs.

The bad news is that there is no silver bullet here, you either need to keep things awake (consume more power) or live with the latency. But the good news (I guess?) is that this is still being actively worked on and you will probably see incremental improvements from every major release.

That said, going by the description alone, it does sound like you hit a particularly egregious case here, so we'll try to repro it and see if there's anything that makes it worse than it needs to be. My guess is that the 144Hz monitor is throwing a wrench into it.