NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.02k stars 1.24k forks source link

[MAJOR] KDE Plasma Wayland & X11 poor performance & frame drops when opening apps #538

Open kodatarule opened 1 year ago

kodatarule commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

535.86.05

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

EndeavourOS Linux

Kernel Release

6.4.6-zen

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

RTX 3090

Describe the bug

When opening apps or just trying to screen record, in general anything which demands more from the GPU it starts losing frames, hitches and lags. This doesn't occur on the proprietary driver

To Reproduce

Load into KDE Plasma wayland and open any app(dolphin,browser, etc)

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

kodatarule commented 12 months ago

Just wanted to make slight update, this also occurs on X11 as well

kodatarule commented 4 months ago

With the news of driver 560 defaulting to the open kernel modules, I decided to give this a second try and this issue is still present on both X11 and wayland.

Operating System: EndeavourOS KDE Plasma Version: 6.0.4 KDE Frameworks Version: 6.1.0 Qt Version: 6.7.0 Kernel Version: 6.8.9-zen1-1-zen (64-bit) Graphics Platform: Wayland Processors: 16 × AMD Ryzen 7 5800X3D 8-Core Processor Memory: 31,2 GiB of RAM Graphics Processor: NVIDIA GeForce RTX 3090/PCIe/SSE2

kodatarule commented 3 months ago

Just to update for people that would come here:

EDIT: The solution was to add NVreg_EnableGpuFirmware=0 to the kernel load and all issues were fixed!

edisionnano commented 3 months ago

Just to update for people that would come here:

EDIT: The solution was to add NVreg_EnableGpuFirmware=0 to the kernel load and all issues were fixed!

This won't work with the open modules, only the closed ones, since the open modules require gsp. Right?

kodatarule commented 3 months ago

I believe so that would be the case, this seems to only work on proprietary.

aritger commented 3 months ago

It would be worth retesting this case with the new 555.42.02 driver: https://www.nvidia.com/Download/driverResults.aspx/224751/en-us/

We made several improvements to graphics performance that will help both the proprietary kernel modules with NVreg_EnableGpuFirmware=1, and the open kernel modules.

mtijanic commented 3 months ago

Tracked internally as bug 4662986.

mtijanic commented 3 months ago

Hi @kodatarule , can I trouble you for two experiments? With the 555.42 driver - Proprietary*, but without NVreg_EnableGpuFirmware=0 (or set it to 1), please try:

(1) Disabling MangoHUD and any other background profiling apps you might have and see if it gets any better. (2) Wait until you see the issue, and then run nvidia-bug-report.sh as soon as you can.

This is so we can get the bug report snapshot soon after a bad state and we know where to look at it, timescale-wise.

Thanks in advance!

* Open is also fine, but then please run with NVreg_RmMsg=":" and also run dmesg -w > dmesg.txt on the side and attach that file too.

kodatarule commented 3 months ago

Hi, I did try with proprietary beta 555.42 with the GPU firmware enabled and have generated a log. Just to update I tried both with mangohud on/off globally which didn't make any change at all. nvidia-bug-report.log.gz

aritger commented 3 months ago

To help better isolate this, looking more carefully at your xorg.conf:

Option         "nvidiaXineramaInfoOrder" "DFP-3"
Option         "metamodes" "DP-2: 2560x1440_165 +0+0 {ForceFullCompositionPipeline=On}, DP-0: 2560x1440_165 +2560+0 {ForceFullCompositionPipeline=On}"
Option         "UseNvKmsCompositionPipeline" "false"

Do you see the same performance problems: (a) if you remove the UseNvKmsCompositionPipeline option (b) if you remove the {ForceFullCompositionPipeline=On} parts (c) if you use slightly lower refresh rates? (I assume 2560x1440_165 is running at 165 Hz)

kodatarule commented 3 months ago

Hello, The option for UseNvKmsCompositionPipeline if removed would create even bigger stutters. ForceComp/ForceFullComp On or Off didn't make any difference, on a side note these options make 0 difference on wayland which is affected either way. Changing refresh rates had no impact on this, it feels like it does weird clocks with GSP Firmware(driver 555 proprietary and all open source drivers prior to this). I'm not sure what could be causing this problem, but I also noticed a lot of reports of people on the nvidia forums as well that GSP does trigger this same behavior for their systems.

mtijanic commented 3 months ago

Update: We've found two possible causes of stutter. Or rather, we found two issues that definitely cause stutter on some configurations, but we still don't have a good idea of how widespread either of them is.

I have published patches that eliminate one and log the other here: https://github.com/NVIDIA/open-gpu-kernel-modules/pull/658

I'd love it if folks that are experiencing these issues would give it a try and report back. Getting a good idea of the impact would help us prioritize getting these in. Many thanks in advance!

ptr1337 commented 3 months ago

@mtijanic Thanks for the patchset! I have patched it for nvidia-open-dkms and pushed it to the testing repository on CachyOS. Users got notified for testing this. Sadly, I can not reproduce this on 40xx GPU's.

ptr1337 commented 3 months ago

Actually, sometimes when doing a screenshot with spectacle, im seeing some little fps drops on the patched nvidia-open-dkms module. This was not present on the closed one, but im not sure if this is fully related.

nvidia-bug-report.log.gz

Virkkunen commented 3 months ago

Update: We've found two possible causes of stutter. Or rather, we found two issues that definitely cause stutter on some configurations, but we still don't have a good idea of how widespread either of them is.

I have published patches that eliminate one and log the other here: #658

I'd love it if folks that are experiencing these issues would give it a try and report back. Getting a good idea of the impact would help us prioritize getting these in. Many thanks in advance!

What would be the proper process of building and installing this patchset? I'm facing these issues on the open-beta-dkms and I'd like to help troubleshoot with my logs

mtijanic commented 3 months ago

What would be the proper process of building and installing this patchset? I'm facing these issues on the open-beta-dkms and I'd like to help troubleshoot with my logs

First, make sure you have regular 555.52.04 driver installed in whatever way you do it normally (distro package, .run file, etc). Then, clone my branch with;

 git clone --single-branch --branch 555-testing-patches https://github.com/mtijanic/open-gpu-kernel-modules.git 555-testing

Then, build it:

 cd 555-testing && make -j16

If successful, it will produce a file kernel-open/nvidia.ko (and many others not relevant here). Check if it exists. Now, you just need to switch to using this instead of your installed nvidia.ko. To find out where it is, you can run

$ modinfo nvidia | grep filename
filename:       /lib/modules/5.15.0-105-generic/kernel/drivers/video/nvidia.ko

Easiest would be to just backup the original file, and replace it with the newly built one:

cd /lib/modules/5.15.0-105-generic/kernel/drivers/video/
sudo mv nvidia.ko nvidia.ko.backup
sudo cp /path/to/555-testing/kernel-open/nvidia.ko .

Or use symlinks.

You'll need to reload the driver for the change to take effect. A system reboot would do it, but also killing X / your DE and then rmmod would work too. For example:

sudo service lightdm stop # or gdm, etc
sudo rmmod nvidia_uvm nvidia_vgpu_vfio nvidia_drm nvidia_modeset nvidia
sudo service lightdm start

To revert, just restore the original backed up file.

ptr1337 commented 3 months ago

@Virkkunen If you are on archlinux, you can also use following PKGBUILD: https://github.com/CachyOS/CachyOS-PKGBUILDS/blob/master/nvidia/nvidia-open-dkms/PKGBUILD

@mtijanic Ive tested this now for around one week and still having here and there stutters, mainly at screenshots or minimizing windows.

Virkkunen commented 2 months ago

Using @ptr1337 PKGBUILD (on endeavour) I was able to install this patch. So far it seems that the stutter while opening, closing and minimising apps, and screen recording (with spectacle) is gone.

However, when moving the cursor I can notice some stutters. Moving quickly in a circle it becomes more apparent, with visible gaps in the circle, like it's skipping some positions. I tried to record a slow motion video of this but it's quite a finnicky thing to visualise in a recording.

https://github.com/NVIDIA/open-gpu-kernel-modules/assets/9111925/66c77e25-5154-4655-8885-bf89157e0757

nvidia-bug-report.log.gz

xpander69 commented 2 months ago

Ok built the open modules with the patches and so far it seems the stutter issues have been fixed! i reported this problem on the nvidia forums for closed modules before. First time using open ones. RTX 3080 555.52.04, 6.9.3-cachyos kernel, Arch Linux, MATE Desktop, X11

edit: OK theres still very minor input related (mouse) stutter now like few periodic frametime spikes..which doesn't happen with closed modules and gsp disabled.

overall seems to be huge improvement, but not yet ideal.

kodatarule commented 2 months ago

After testing out the open modules with the patches, the situation has improved somewhat, but the hitches when opening apps or moving the cursor are still present. Attached is a bug report.

nvidia-bug-report.log.gz

ptr1337 commented 2 months ago

@mtijanic I have just updated to the stable 555.58 driver (closed one), enabled the GSP Firmware but these stutters are still present.

Ive noticed, the PR from you got merged. Here a video, where its mainly visible on doing a screenshot with spectacle.

https://github.com/NVIDIA/open-gpu-kernel-modules/assets/70081076/7e33f71c-4b6c-4def-b020-85644d96646b nvidia-bug-report.log.gz

mtijanic commented 2 months ago

Follow-up on this:

Update: We've found two possible causes of stutter. Or rather, we found two issues that definitely cause stutter on some configurations, but we still don't have a good idea of how widespread either of them is.

In 555.58.02 (but not 555.58 from last week) we fixed the bigger of the two causes. Particularly those using kwin should give this a try and report back. 555.58.02 does not include https://github.com/NVIDIA/open-gpu-kernel-modules/pull/658/commits/674c009526b4a47c5dece5a7a2facc7e637bead7 which fixes a different, less frequent cause. You can still apply this commit manually if using the Open modules, and it will be included in 560.xx.

Please test and report back! :heart:

ptr1337 commented 2 months ago

@mtijanic Desktop generally runs fine, the only problem, which im still seeing (with https://github.com/NVIDIA/open-gpu-kernel-modules/commit/674c009526b4a47c5dece5a7a2facc7e637bead7 and also without) that spectacle is sometimes "laggy" and just jumps, like you see above.

I made you a fresh video and nvidia-bugreport.sh, see below.

https://github.com/NVIDIA/open-gpu-kernel-modules/assets/70081076/da9d2f81-3381-4724-95eb-5a07b37b17ed

nvidia-bug-report.log.gz

Edit:

I will test further with the closed source driver + GSP enabled.

mtijanic commented 2 months ago

I will test further with the closed source driver + GSP enabled.

Please! Closed source and GSP ON vs OFF will give us the best info to triage further.

Thanks a ton, for all the reports you've sent in so far! We might not get a chance to meaningfully reply to them all, but we do really appreciate it.

ptr1337 commented 2 months ago

I will test further with the closed source driver + GSP enabled.

Please! Closed source and GSP ON vs OFF will give us the best info to triage further.

Thanks a ton, for all the reports you've sent in so far! We might not get a chance to meaningfully reply to them all, but we do really appreciate it.

Retsted with the closed source driver with GSP on and off. The issue also appears when I have the GSP Firmware enabled.

Here are comparison:

GSP ON:

https://github.com/NVIDIA/open-gpu-kernel-modules/assets/70081076/a1be2c07-2b24-48c1-b8db-fa9214f68f9e

nvidia-bug-report.log.gz

GSP Off:

https://github.com/NVIDIA/open-gpu-kernel-modules/assets/70081076/7d95163f-d2a5-41f5-8c4d-16d8bdb997a8

nvidia-bug-report.log.gz

Edit: It definitly improved compared to without the patches, but mainly at spectacle I still see these hiccups.

kodatarule commented 2 months ago

With 555.58.02 it has definitely improved a lot, however I still notice a few hiccups here and there.

nvidia-bug-report.log.gz

urbenlegend commented 2 months ago

I just tested with 555.58.02 with GSP off and on and I am still seeing weird judders and hitches simply dragging KDE's Dolphin file manager around on the desktop whenever GSP is enabled. When it is off, the window motion is very smooth.

The issue seems to come and go. With GSP, the first few window moves will be smooth, but continuously moving the window around will cause hitching. Without GSP, it is smooth the entire time.

omnigenous commented 2 months ago

Where exactly do I add NVreg_EnableGpuFirmware=0 to the kernel load on arch linux?

MishaProductions commented 2 months ago

Add nvidia.NVreg_EnableGpuFirmware=0 to the variable GRUB_CMDLINE_LINUX_DEFAULT in the /etc/default/grub file, and run grub-mkconfig -o /boot/grub/grub.cfg

urbenlegend commented 2 months ago

@omnigenous In addition to what MishaProductions said, make sure to prepend the module name to that option, so like nvidia.NVreg_EnableGpuFirmware=0

zoobporsor commented 1 month ago

after I installed Nvidia on arch, KDE was super laggy and choppy. when I applied nvidia.NVreg_EnableGpuFirmware=0 to my kernel parameters via grub, it fixed it now it is smooth. I use RTX 2080ti

clapbr commented 1 month ago

560 beta out claims to improve this, still bad on my 3090 though

SeongGino commented 1 month ago

Oh, so this is where my problem was?

560 beta, 3060ti, Linux 6.9, Plasma 6.1.3 on Arch. On the Wayland session it starts smooth, but just a few seconds after starting it was bad enough that Plasmashell would freeze at what felt like randomly regular intervals. Disabling the GSP as a kernel param made Plasma's Wayland session perfectly smooth.

mtijanic commented 1 month ago

Hey @SeongGino can you check if you have coolercontrol program running? It's a known cause of this stutter since the way it queries the data is by starting and killing an nvidia-smi process all the time. This startup (and especially shutdown) talks to GSP and can stall out some other things.

I believe they've fixed this and switched to NVML, but there's still no release that picked up that patch. See https://gitlab.com/coolercontrol/coolercontrol/-/issues/288

In the meantime, we'll look into ways to make this shutdown less impactful so we don't depend on patching all third party tools.

urbenlegend commented 1 month ago

Nvidia 560 with GSP on seems to have lessened the frame drops but it is still not on par with GSP off. If I drag a window around (say KDE Dolphin) for an extended period of time, like greater than 5 seconds, it will start stuttering again. It is much less pronounced than before, but still not the perfectly smooth action that you get with GSP off.

SeongGino commented 1 month ago

Hey @SeongGino can you check if you have coolercontrol program running? It's a known cause of this stutter since the way it queries the data is by starting and killing an nvidia-smi process all the time. This startup (and especially shutdown) talks to GSP and can stall out some other things.

I believe they've fixed this and switched to NVML, but there's still no release that picked up that patch. See https://gitlab.com/coolercontrol/coolercontrol/-/issues/288

In the meantime, we'll look into ways to make this shutdown less impactful so we don't depend on patching all third party tools.

I've never heard of or used this coolercontrol in my life.

But, I do have Plasma System Monitor applets running, one of them set to track GPU Usage stats. Enabling the GSP and removing the GPU monitor widget did seem to resolve the stutter for me in the Wayland session.

As far as I can tell, anyways--dragging a window around like Dolphin doesn't seem to be exhibiting the same hitching.

mtijanic commented 1 month ago

Thanks @SeongGino! Do you know what exact applet this is? Please keep in mind I'm not at all familiar with KDE and its family of tools, so dumb it down for me :)

I found this https://github.com/lestofante/ksysguard-gpu which already has a an issue open for this.

SeongGino commented 1 month ago

@mtijanic It's not an external component like what you've linked; it looks like it's part of the stock Plasma desktop widgets--or if it is extra, it most likely comes with KSysGuard.

2024_07-25 130818 2024_07-25 130319-CPU Usage Settings

mtijanic commented 1 month ago

Thanks! If I'm reading it correctly, the relevant code is at https://invent.kde.org/plasma/libksysguard/-/blob/master/processcore/plugins/nvidia/nvidia.cpp?ref_type=heads and it indeed spawns an nvidia-smi process every time to get the data. This should ideally move to using libnvidia-ml.so, but this is the first time I'm seeing this code so I can't say how easy that is to do.

SeongGino commented 1 month ago

I see! Well, I posted an issue on KDE's bugtracker linking back to this issue, so hopefully there will be some response.

Kimiblock commented 1 month ago

560 reduces the stutter, but it is nowhere near Proprietary + GSP Off.

If I scroll in Firefox after the desktop sits idle for some time, it'll lag for seconds.

Kimiblock commented 1 month ago

Also, GNOME suffers from this issue a lot, especially when opening the Overview. It stutters almost every time even if triple buffering is enabled.

Kimiblock commented 1 month ago

If I put something demanding running on the GPU, the performance level jumps to P0 and GNOME is smooth again. Maybe pinning the perf level can bypass this.

bugQ commented 1 month ago

I see! Well, I posted an issue on KDE's bugtracker linking back to this issue, so hopefully there will be some response.

looks like you got your response, @mtijanic:

The problem with using the suggested library is that the headers are in a proprietary SDK that cannot be freely distributed, which means that it would make the NVidia GPU integration practically unbuildable on most machines. Even if we were to include the header in ksystemstats (which its license doesn't actually allow, but I see some projects do) we'd still be stuck since the library itself is bundled in the driver and that is generally also not installed on build machines.

So ultimately, running nvidia-smi is pretty much the only way we can support this without introducing a nasty build system issue. And frankly, it seems to me that it's an upstream issue anyway? Running nvidia-smi shouldn't have such an impact in the first place?

— Arjen Hiemstra 2024-08-06 11:03:41 UTC

mtijanic commented 3 weeks ago

NVIDIA bug 4804613 filed to track stutter with ksysguard (and nvidia-smi pmon in general)