ValveSoftware / steam-for-linux

Issue tracking for the Steam for Linux beta client
4.23k stars 174 forks source link

Precompiled shaders are pruned by nvidia driver, causing stutter from cold cache effects in games with excessive numbers of shaders #11392

Open ryao opened 1 day ago

ryao commented 1 day ago

Your system information

Please describe your issue in as much detail as possible:

A friend complained to me about Overwatch 2 suffering stutter from constant shader compilation with the nvidia driver, so I looked into it. It appears that the nvidia driver is pruning the shader cache, causing it to compile shaders ad infinitum during game sessions, which causes stutter.

If we launch Overwatch 2 withDXVK_HUD=compiler %command%, we can see when it compiles shaders, which is quite often. If we view CPU utilization on another screen (or by alt-tabbing) in htop or another tool, we can also see that CPU utilization is often pegged at 100%.

The nvidia driver historically limits the size of its shader cache to 128MB, but in the 460 driver, increased this to 1024MB:

https://www.nvidia.com/download/driverResults.aspx/167671/en-us/

Under the assumption that we were hitting the shader cache limit, I had my friend add __GL_SHADER_DISK_CACHE_SIZE=10737418240 to his launch options to increase his shader cache size to 10GB as per:

https://download.nvidia.com/XFree86/Linux-x86_64/550.127.05/README/openglenvvariables.html

After a 2 hour session, my friend reported to me that the excessive shader compilation had stopped and his shader cache was 5.7GB, which far exceeds the 1GB default limit.

I actually joined him in the session, and confirmed excessive shader compilation. Before setting GL_SHADER_DISK_CACHE_SIZE, I could not launch Overwatch 2 without excessive CPU utilization (all cores being pegged) while DXVK claimed shader compilation was being done. Additionally, the excessive CPU utilization was so bad on my machine that discord had static noise both from what I heard when my friend spoke and reportedly, when he heard me speak. After setting GL_SHADER_DISK_CACHE_SIZE and joining my friend's session, my shader cache grew to 4.3GB by the time the session ended. Immediately after relaunching Overwatch 2, I can see my CPU utilization is low, which did not happen until setting __GL_SHADER_DISK_CACHE_SIZE.

In order to be absolutely certain of what is happening, I would want to be able to tell steam to re-run fossilize replay and watch the size of the shader cache, but I cannot find a documented way to do that no matter how much I look. Without that, investigating this further would be a major time drain, but I think I have enough information to make a report and hopefully someone at Valve could advise on how to force fossilize_replay to run on a game so people can gather more data points efficiently.

That said, based on what I have observed, one of two things seems to be happening under the default settings:

  1. The nvidia driver is actively pruning the shader cache that fossilize_replay tries to prime as it runs
  2. The nvidia prunes the primed shader cache at game launch.

Either way, the shader cache on Nvidia hardware is being actively pruned to keep it within the default limit, making it useless for games with many gigabytes of shaders under the default settings, as cold cache effects will persist due to the default cache size limit.

Steps for reproducing this issue:

  1. Get a Nvidia machine.
  2. Enable steam shader precaching
  3. Install Overwatch 2 (other games may also have this issue)
  4. Set DXVK_HUD=compiler %command% as launch options to Overwatch 2
  5. Play Overwatch 2 for an hour
  6. Observe excessive CPU utilization and shader compilation, plus stutter
  7. Restart Overwatch 2 and play it for an hour
  8. Observe excessive CPU utilization and shader compilation, plus stutter
  9. Add __GL_SHADER_DISK_CACHE_SIZE=10737418240 to launch options
  10. Restart Overwatch 2 and play it for an hour
  11. Observe excessive CPU utilization and shader compilation gradually go away, along with associated stutter
  12. Restart Overwatch 2 and play it for an hour
  13. Observe there is no excessive CPU utilization or shader compilation
kisak-valve commented 1 day ago

Hello @ryao, this issue should be reported to your video driver vendor.

kisak-valve commented 1 day ago

Thinking about this a bit more, Steam sets __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1 when running games and the NVIDIA driver should respond to that environment variable by disabling the driver's disk cache limits.

ryao commented 1 day ago

@kisak-valve You are right. Interestingly, I just started overwatch again and it is compiling shaders. Thanks to the DXVK_HUD=compiler, I see Compiling Shaders... (44%) at the bottom. That did not happen an hour ago.

Before I filed the issue, I had tried unchecking/rechecking "Enable Shader Pre-caching" to try to force steam to run fossilize replay, but it did not seem to work. I observed no change in the nvidia cache. I then started Overwatch 2, observed low CPU utilization and opened the issue. Now an hour later I start Overwatch 2 again and suddenly see high CPU utilization from DXVK doing shader compilation. The shader cache is also growing. This non-deterministic behavior does not really make sense to me.

It would be helpful if there was some way of running fossilize replay on demand so that the behavior could be studied. That is the reason I opened this issue despite my research being half baked. I don't have the time to investigate this for my friend unless I can prepopulate the cache to study the behavior from different states. He had switched to Linux on my recommendation and is understandably annoyed that Overwatch 2 has some stutter issues. Initial observations show that it is shader disk cache related, but more experiments are needed to understand the behavior and being able to invoke fossilize replay would make those easy to do.

ryao commented 1 day ago

Overwatch 2 somehow stopped rendering new frames from excessive alt-tabbing so I killed it during DXVK shader compilation and restarted it. It now isn't compiling shaders... this behavior is bizarre. :/

My cache is now 7.5GB in size.

ryao commented 1 day ago

I just played a game after changing some graphical settings and the on-disk shader cache shrank from 7.5GB to 4.5GB afterward. There is some kind of pruning happening, even though __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1 is set. :/

ryao commented 1 hour ago

I reported this to Nvidia. It would be helpful if Valve documented how to trigger fossilize replay on demand to prime the cache to make it easier to observe the cache pruning bug.