NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.14k stars 1.26k forks source link

CSGO running smooth for a couple seconds, then HEAVILY dropping, then going back to normal, repeat #335

Closed duckyondiscord closed 1 year ago

duckyondiscord commented 2 years ago

NVIDIA Open GPU Kernel Modules Version

515.57-9

Does this happen with the proprietary driver (of the same version) as well?

No

Operating System and Version

Arch Linux

Kernel Release

5.19.0-rc7-1-mainline

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-712fbdf4-63a5-5e55-3624-58bcb8b9aac3)

Describe the bug

With the proprietary driver, CSGO runs at around the same framerate as it does with the open-source driver. Except that with the open-source driver, the game runs smooth(200-250FPS) for around 3-5 seconds, then drops to around 16-20FPS for around the same amount of time, and then it goes back to normal and repeats the cycle indefinitely.

To Reproduce

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

niv commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

duckyondiscord commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

By backup you mean a program that constantly backs up files? If so, there is no program manipulating files in the background that I know of. KDE Plasma's baloo file indexer is also disabled.

There's absolutely nothing impacting framerate running in the background that I know of. I usually do checks every week/day to see if there's stuff running that I don't want.

niv commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

By backup you mean a program that constantly backs up files? If so, there is no program manipulating files in the background that I know of. KDE Plasma's baloo file indexer is also disabled.

Yep, I was just asking if there's something that eats significant system resources, to explain the hiccups.

duckyondiscord commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

By backup you mean a program that constantly backs up files? If so, there is no program manipulating files in the background that I know of. KDE Plasma's baloo file indexer is also disabled.

Yep, I was just asking if there's something that eats significant system resources, to explain the hiccups.

With the same set of background programs(not a lot, and none of them significantly impacts resource usage), the proprietary driver does not have these hiccups.

niv commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

By backup you mean a program that constantly backs up files? If so, there is no program manipulating files in the background that I know of. KDE Plasma's baloo file indexer is also disabled.

Yep, I was just asking if there's something that eats significant system resources, to explain the hiccups.

With the same set of background programs(not a lot, and none of them significantly impacts resource usage), the proprietary driver does not have these hiccups.

Do you perchance now if you saw this happen on a previous OpenRM release (not .57)?

duckyondiscord commented 2 years ago

Thanks for the report. This is on a clean system with no significant background processes on your end (e.g. backup running)?

By backup you mean a program that constantly backs up files? If so, there is no program manipulating files in the background that I know of. KDE Plasma's baloo file indexer is also disabled.

Yep, I was just asking if there's something that eats significant system resources, to explain the hiccups.

With the same set of background programs(not a lot, and none of them significantly impacts resource usage), the proprietary driver does not have these hiccups.

Do you perchance now if you saw this happen on a previous OpenRM release (not .57)?

This one's the first open-source driver I tried, if that's what you mean.

niv commented 2 years ago

Thanks for the report. Tracking internally in bug 3732803.

This issue will be updated when there's progress.

duckyondiscord commented 2 years ago

Thanks for the report. Tracking internally in bug 3732803.

This issue will be updated when there's progress.

Alright, thanks a lot!

aritger commented 2 years ago

@duckyondiscord : One experiment that would be useful is if you could test with the proprietary driver but with GSP enabled (the open kernel modules unconditionally use GSP firmware, but the proprietary driver defaults to not yet using GSP firmware on GeForce RTX 3050). It would help to know if the performance problem reproduces with the proprietary driver + GSP.

http://us.download.nvidia.com/XFree86/Linux-x86_64/515.57/README/gsp.html

Add options nvidia NVreg_EnableGpuFirmware=1 to a modprobe.d configuration file.

duckyondiscord commented 2 years ago

@duckyondiscord : One experiment that would be useful is if you could test with the proprietary driver but with GSP enabled (the open kernel modules unconditionally use GSP firmware, but the proprietary driver defaults to not yet using GSP firmware on GeForce RTX 3050). It would help to know if the performance problem reproduces with the proprietary driver + GSP.

http://us.download.nvidia.com/XFree86/Linux-x86_64/515.57/README/gsp.html

Add options nvidia NVreg_EnableGpuFirmware=1 to a modprobe.d configuration file.

Sure, I can try that

duckyondiscord commented 2 years ago

@duckyondiscord : One experiment that would be useful is if you could test with the proprietary driver but with GSP enabled (the open kernel modules unconditionally use GSP firmware, but the proprietary driver defaults to not yet using GSP firmware on GeForce RTX 3050). It would help to know if the performance problem reproduces with the proprietary driver + GSP.

http://us.download.nvidia.com/XFree86/Linux-x86_64/515.57/README/gsp.html

Add options nvidia NVreg_EnableGpuFirmware=1 to a modprobe.d configuration file.

I don't really know how modprobe configs work, but I'm speculating that I just create a config file named nvidia.conf in /etc/modprobe.d/ and type those options in, right?

aritger commented 2 years ago

Correct. See also the modprobe.d(5) man page for more details.

You can follow the same pattern you're using to set the NVreg_OpenRmEnableUnsupportedGpus kernel module parameter to enable the open kernel modules on a notebook GPU... or is it the Arch Linux package that sets that?

duckyondiscord commented 2 years ago

@duckyondiscord : One experiment that would be useful is if you could test with the proprietary driver but with GSP enabled (the open kernel modules unconditionally use GSP firmware, but the proprietary driver defaults to not yet using GSP firmware on GeForce RTX 3050). It would help to know if the performance problem reproduces with the proprietary driver + GSP.

http://us.download.nvidia.com/XFree86/Linux-x86_64/515.57/README/gsp.html

Add options nvidia NVreg_EnableGpuFirmware=1 to a modprobe.d configuration file.

Okay, I tested it with the GSP enabled, and I'm actually seeing a performance BENEFIT of about 10-20FPS instead of the lag I see with the open driver. And, yes, nvidia-smi -q | grep GSP shows that it's in fact enabled.

duckyondiscord commented 2 years ago

I'll add this to my original issue: In 515.57-9, the -9 is added by the Arch package, so, take it as if it was just 515.57

aritger commented 2 years ago

Well, that's not what I expected, but good to know. To be clear: with GSP enabled, you see stable performance (10-20fps greater than without GSP), rather than 3-5 seconds at one performance and then 3-5 seconds at the faster performance?

duckyondiscord commented 2 years ago

Well, that's not what I expected, but good to know. To be clear: with GSP enabled, you see stable performance (10-20fps greater than without GSP), rather than 3-5 seconds at one performance and then 3-5 seconds at the faster performance?

It's sort of hard-to-measure, as CSGO's performance fluctuates a lot depending on the map, and obviously how many people, guns, particles are in the scene, but I'd say, on average I have better performance with the GSP enabled.

What is interesting though, is that I see a new performance issue with the GSP and the proprietary driver, and it's that the game totally 100% freezes around 15-30 minutes into playing. I have no idea if I can report issues with the proprietary driver on this GitHub page though.

mtijanic commented 2 years ago

The unexpected part is that the proprietary module running in GSP-offload is faster than the open source module. Those two should behave mostly identically.

Is it possible to get an nvidia-bug-report.log from the proprietary driver running GSP-offload and running CS:GO?

duckyondiscord commented 2 years ago

The unexpected part is that the proprietary module running in GSP-offload is faster than the open source module. Those two should behave mostly identically.

Is it possible to get an nvidia-bug-report.log from the proprietary driver running GSP-offload and running CS:GO?

sure, I could manage that

duckyondiscord commented 2 years ago

The unexpected part is that the proprietary module running in GSP-offload is faster than the open source module. Those two should behave mostly identically.

Is it possible to get an nvidia-bug-report.log from the proprietary driver running GSP-offload and running CS:GO?

nvidia-bug-report.log.gz

duckyondiscord commented 2 years ago

Is this still not addressed? How is GSP interface different in open driver? BTW, is GSP faster in other applications (I am talking about propritaery driver here, sinde it is faster in CSGO).

I will try this again shortly and will test some other apps with the GSP

bno1 commented 2 years ago

I have a Lenovo Legion 7 15IMHg05 (RTX 2060) and in basically every games I get random fps drops every ~2 minutes. I discovered that during those drops the CPU clock drops from 3-4GHz to 800MHz.

I found 2 different solutions to this problem: a) Disable CPU turbo b) Switch my laptop from quiet/balanced mode to performance mode

I don't know the source of this problem. I have a few more details on an arch forum thread: https://bbs.archlinux.org/viewtopic.php?id=273136 in case I missed something,

mtijanic commented 2 years ago

Hi @duckyondiscord, we did use CS:GO in a lot of our internal testing (and some folks that dogfood this driver also play CS:GO), and while we did get our fair share of issues, we were unable to reproduce this particular instance.

Typically when we see such severe frame stutter, it is because there's a different process polling some GPU state (e.g. MangoHUD or similar overlay using NVML to get GPU stats). I assume that is not the case here, since you did mention no background processes, but just checking? That would not explain the difference between Open-GSP and Proprietary-GSP versions anyway.

Anyway, would it be possible for you to run some additional diagnostics on your machine? We need some way to correlate these stutters with what the driver is doing, and from just the log it's hard to say since we have no idea what the game is doing at the time. I can think of two ways to get this data:

  1. A bpftrace script that will instrument both csgo and the kernel driver
  2. A small lib to LD_PRELOAD when running csgo and (possibly) a patch to the kernel driver

Please let me know if you're comfortable with either of these options (obviously source available for all of it) and I can prepare it.

Either way, thanks a bunch for the report and your testing so far!

duckyondiscord commented 2 years ago

Hi @duckyondiscord, we did use CS:GO in a lot of our internal testing (and some folks that dogfood this driver also play CS:GO), and while we did get our fair share of issues, we were unable to reproduce this particular instance.

Typically when we see such severe frame stutter, it is because there's a different process polling some GPU state (e.g. MangoHUD or similar overlay using NVML to get GPU stats). I assume that is not the case here, since you did mention no background processes, but just checking? That would not explain the difference between Open-GSP and Proprietary-GSP versions anyway.

Anyway, would it be possible for you to run some additional diagnostics on your machine? We need some way to correlate these stutters with what the driver is doing, and from just the log it's hard to say since we have no idea what the game is doing at the time. I can think of two ways to get this data:

  1. A bpftrace script that will instrument both csgo and the kernel driver
  2. A small lib to LD_PRELOAD when running csgo and (possibly) a patch to the kernel driver

Please let me know if you're comfortable with either of these options (obviously source available for all of it) and I can prepare it.

Either way, thanks a bunch for the report and your testing so far!

Sorry for the late response, I can do those if you tell me where I can get that bpftrace script, the driver patch and the library I should load with LD_PRELOAD Edit: I don't have anything polling GPU state/stats running.

mtijanic commented 2 years ago

Thanks! I'll get back to you once I've prepared the scripts. Installing CS:GO now :)

duckyondiscord commented 2 years ago

I've been having another issue with render offloading recently, which may stop me from being able to debug this for a while. It's a really strange one, since CS:GO seems to be using the NVIDIA GPU, according to nvidia-smi, but my frame rate's stuck around 60-75, and it feels even less. Also occurs on the proprietary driver, so I don't know where to report this one.

duckyondiscord commented 1 year ago

Closing as it got fixed, I don't know which update fixed it though.