Open ryao opened 3 weeks ago
Hello @ryao, this issue should be reported to your video driver vendor.
Thinking about this a bit more, Steam sets __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
when running games and the NVIDIA driver should respond to that environment variable by disabling the driver's disk cache limits.
@kisak-valve You are right. Interestingly, I just started overwatch again and it is compiling shaders. Thanks to the DXVK_HUD=compiler, I see Compiling Shaders... (44%)
at the bottom. That did not happen an hour ago.
Before I filed the issue, I had tried unchecking/rechecking "Enable Shader Pre-caching" to try to force steam to run fossilize replay, but it did not seem to work. I observed no change in the nvidia cache. I then started Overwatch 2, observed low CPU utilization and opened the issue. Now an hour later I start Overwatch 2 again and suddenly see high CPU utilization from DXVK doing shader compilation. The shader cache is also growing. This non-deterministic behavior does not really make sense to me.
It would be helpful if there was some way of running fossilize replay on demand so that the behavior could be studied. That is the reason I opened this issue despite my research being half baked. I don't have the time to investigate this for my friend unless I can prepopulate the cache to study the behavior from different states. He had switched to Linux on my recommendation and is understandably annoyed that Overwatch 2 has some stutter issues. Initial observations show that it is shader disk cache related, but more experiments are needed to understand the behavior and being able to invoke fossilize replay would make those easy to do.
Overwatch 2 somehow stopped rendering new frames from excessive alt-tabbing so I killed it during DXVK shader compilation and restarted it. It now isn't compiling shaders... this behavior is bizarre. :/
My cache is now 7.5GB in size.
I just played a game after changing some graphical settings and the on-disk shader cache shrank from 7.5GB to 4.5GB afterward. There is some kind of pruning happening, even though __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
is set. :/
I reported this to Nvidia. It would be helpful if Valve documented how to trigger fossilize replay on demand to prime the cache to make it easier to observe the cache pruning bug.
I filed internal bug 4934720.
Please, can you share which driver are you using? Or ideally proton.log or dxvk log.
Just to verify, are you setting both __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
and __GL_SHADER_DISK_CACHE_SIZE=10737418240
?
@peterkohaut-nv Thanks, but this was filed at Nvidia yesterday as internal bug 4932793. I did not get the bug number to post here until about an hour ago. A number of details are already there, including nvidia-bug-report.log.gz files.
This was observed on Kubuntu 24.04 running 550.107.02, and on Gentoo running 560.28.03.
The cold cache effects were observed when only __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
was set. They disappeared soon after adding __GL_SHADER_DISK_CACHE_SIZE=10737418240
. Unfortunately, I don't have DXVK or proton log files at the moment. I will try to get them for you.
Right now, from what I have been told, there seem to be two issues:
__GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
is set, new shaders are not added to the disk cache if the cache size exceeds __GL_SHADER_DISK_CACHE_SIZE
. This is likely the bug that prompted filing this report. This behavior is contrary to what everyone outside Nvidia thought that __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1
did. Also, coincidentally, __GL_SHADER_DISK_CACHE_SKIP_CLEANUP
being set triggers the skip cleanup behavior, and the value is actually ignored, so setting to 0 would not override Valve's setting it, which complicates testing.__GL_SHADER_DISK_CACHE_SIZE
is set and is bigger than the disk cache size, sometimes the disk cache will shrink across game sessions. This was observed after filing this issue and we do not know how to reproduce it reliably. Yesterday, I had thought it was related to 1, but that apparently is not the case.@kisak-valve It would be helpful if Valve provided a way to force steam to trigger fossilize replay to prime the disk cache between runs to help figure out how to reproduce 2. Is there any chance you could put us in touch with someone who could say how to do that?
Your system information
Please describe your issue in as much detail as possible:
A friend complained to me about Overwatch 2 suffering stutter from constant shader compilation with the nvidia driver, so I looked into it. It appears that the nvidia driver is pruning the shader cache, causing it to compile shaders ad infinitum during game sessions, which causes stutter.
If we launch Overwatch 2 with
DXVK_HUD=compiler %command%
, we can see when it compiles shaders, which is quite often. If we view CPU utilization on another screen (or by alt-tabbing) in htop or another tool, we can also see that CPU utilization is often pegged at 100%.The nvidia driver historically limits the size of its shader cache to 128MB, but in the 460 driver, increased this to 1024MB:
https://www.nvidia.com/download/driverResults.aspx/167671/en-us/
Under the assumption that we were hitting the shader cache limit, I had my friend add
__GL_SHADER_DISK_CACHE_SIZE=10737418240
to his launch options to increase his shader cache size to 10GB as per:https://download.nvidia.com/XFree86/Linux-x86_64/550.127.05/README/openglenvvariables.html
After a 2 hour session, my friend reported to me that the excessive shader compilation had stopped and his shader cache was 5.7GB, which far exceeds the 1GB default limit.
I actually joined him in the session, and confirmed excessive shader compilation. Before setting GL_SHADER_DISK_CACHE_SIZE, I could not launch Overwatch 2 without excessive CPU utilization (all cores being pegged) while DXVK claimed shader compilation was being done. Additionally, the excessive CPU utilization was so bad on my machine that discord had static noise both from what I heard when my friend spoke and reportedly, when he heard me speak. After setting GL_SHADER_DISK_CACHE_SIZE and joining my friend's session, my shader cache grew to 4.3GB by the time the session ended. Immediately after relaunching Overwatch 2, I can see my CPU utilization is low, which did not happen until setting __GL_SHADER_DISK_CACHE_SIZE.
In order to be absolutely certain of what is happening, I would want to be able to tell steam to re-run fossilize replay and watch the size of the shader cache, but I cannot find a documented way to do that no matter how much I look. Without that, investigating this further would be a major time drain, but I think I have enough information to make a report and hopefully someone at Valve could advise on how to force fossilize_replay to run on a game so people can gather more data points efficiently.
That said, based on what I have observed, one of two things seems to be happening under the default settings:
Either way, the shader cache on Nvidia hardware is being actively pruned to keep it within the default limit, making it useless for games with many gigabytes of shaders under the default settings, as cold cache effects will persist due to the default cache size limit.
Steps for reproducing this issue:
DXVK_HUD=compiler %command%
as launch options to Overwatch 2__GL_SHADER_DISK_CACHE_SIZE=10737418240
to launch options