ValveSoftware / steam-for-linux

Issue tracking for the Steam for Linux beta client
4.17k stars 173 forks source link

Precompiling to unusable shader cache directories on Nvidia #9803

Open jkrhu opened 1 year ago

jkrhu commented 1 year ago

Compatibility Report

The Shader Pre-Caching System/Fossilize compiles gigabytes of shader caches in directories that are never used by the game.

Symptoms

The latest example of it, is when I've been playing DX: Human Revolution. It took me a lot of time to precompile over 2,5GB of GLCache to a directory, that the Nvidia shader compiler completely ignores. It doesn't write to or read from it. Instead it creates it's own folder structure, where I am basically starting from scratch. While I have gigs worth of dead cache files sitting on my hard drive.

How does Steam deal with it? Is it smart enough to know, that those GBs of data aren't actually being used? Does it have the ability to purge those files eventually, eg. when nothing new gets merged with it over time or nothing new is actually being written to it? Is it caused by a change in the Nvidia shader compiler on more recent drivers?

It would be great to know what happens with stale/old/unused shader caches.

With DX:HR I have a directory under GLCache named de58603f3345dbf900e391978751ae76 with over 2,5GB of cache files that my system crunched through and never touched again. There is another directory named 8b59f9b4a84a1bd85f572e8a5406b4a1 that has only the cache that I've built and the cache that Steam uploads from me to the servers, then downloads the next day (few kb delta update).

With other games, it's even worse. The Final Fantasy XIII had me crunch through 5GB of cache, then got merged into about 3GB and it just sits there, doing nothing. Can something be done about it?

System Information

Reproduction

  1. Download DX:HR from Steam.
  2. Pre-compile all shaders
  3. Notice stuttering in game
  4. Check shadercache folder in SteamLibrary
  5. Notice two separate directories for shader cache
kakra commented 1 year ago

This could explain my following observation:

On my desktop with NVIDIA, running Horizon Zero Dawn pre-compiles shader caches on the main menu screen every time I start the game. OTOH, on the Steam Deck, it didn't do that and generally also loads faster.

This does seem to affect only some games, tho.

Also, I'm seeing stutter in some games lately: I'm playing AC Valhalla and Elite Dangerous a lot lately, and both games stutter when at least Elite Dangerous didn't stutter before. This observation is since going from NVIDIA 525 to 535 (Valhalla doesn't run properly on 525 so I don't know if it would have stuttered before).

HansKristian-Work commented 1 year ago

NV 535 is known to have broken shader caching at least. Apparently a fix is on the way. But overall, Fossilize does not control where shader caches go. That's Steam controlling things. Wonder if some env-var controlling shader cache locations just broke?

jkrhu commented 1 year ago

I was on 535 for some time, but had to revert to 530 due to screen tearing issues with Vsync. Overall this issue happens on both.

To me it looks like Fossilize is sending data to the shader compiler and it treats it as a separate thing from the game itself. I would honestly have to keep playing for a few days, see if Steam gathers all the shader hits and sends it back to me few days after. Then I can delete all previous cache from the disk, run pre-caching again and see if it will replay to both old and new GLCache locations. I could also try copying those binaries over to the new location and see if that works.

I don't think there should be any distinction between replaying on Nvidia and AMD. It's supposed to be the same package of SPIR-V code for both, right? And it's supposed to I guess "trick" the compiler that I'm running that particular title? Well, for some reason that stopped working on Nvidia. Steam might need to merge absolutely everything after the GLCache folder into one thing or like I said earlier copy older directory into the new directory. But then I guess how would it know what is the correct directory if I haven't launched the game first. It's probably something Nvidia had done and it's breaking things.

HansKristian-Work commented 1 year ago

I assume something has gone very wrong, but Fossilize in general has no control over where disk shader caches go. It just replays the pipelines it's given, and it's up to the caller to ensure that the drivers place caches in the right places, usually through various environment variables. Steam is also responsible for merging NV shader caches after the replay. I'll try to reproduce.

HansKristian-Work commented 1 year ago

@kisak-valve I think this belongs in steam-for-linux repo if anything.

jkrhu commented 1 year ago

According to the shader log, Steam doesn't actually request any shaders from the pool 8b59f9b4a84a1bd85f572e8a5406b4a1 that is actually being used by the game.

[2023-07-08 01:17:18] Reading 4278 hit entries from cache file /home/jakubkrych/.local/share/Steam/userdata/81766792/config/shaderhitcache/20469eb62929b8f1/8b59f9b4a84a1bd85f572e8a5406b4a1/238010.
[2023-07-08 01:17:18] 20469eb62929b8f1 / L2:8b59f9b4a84a1bd85f572e8a5406b4a1: 4933 shaders.
[2023-07-08 01:17:18] Done; 4278 already registered, 655 succeeded, 0 failed, 0 busy.
[2023-07-08 01:17:18] Server requested 0 shaders; uploading is enabled.

It only seems to care about the de58603f3345dbf900e391978751ae76 and all pre-compiled shaders get stored there. Which is unused by the game itself.

[2023-07-08 01:17:18] Reading 74607 hit entries from cache file /home/jakubkrych/.local/share/Steam/userdata/81766792/config/shaderhitcache/856bf594b3320fd7/de58603f3345dbf900e391978751ae76/238010.
[2023-07-08 01:17:18] 856bf594b3320fd7 / L2:de58603f3345dbf900e391978751ae76: 74607 shaders.
[2023-07-08 01:17:18] Reading 13958 hit entries from cache file /home/jakubkrych/.local/share/Steam/userdata/81766792/config/shaderhitcache/SteamSwarm/VulkanPipelinesV6_904f69d2b1b44b65/238010.
[2023-07-08 01:17:18] SteamSwarm / G7:VulkanPipelinesV6_904f69d2b1b44b65: 13985 shaders.
[2023-07-08 01:17:18] Done; 13958 already registered, 27 succeeded, 0 failed, 0 busy.
[2023-07-08 01:17:18] Server requested 27 shaders; uploading is enabled.

So I assume all new pipelines from either location captured by Fossilize get cached to the unused location.

Uploaded the shader log if you want to investigate: shader_log.previous.txt shader_log.txt

kisak-valve commented 11 months ago

When it becomes available via your distro's normal update process, please retest with the NVIDIA 535.86.05 driver release.

jkrhu commented 11 months ago

Problem still occurs on the latest 535.86.05 driver.

While playing Deus Ex: Human Revolution, Steam created a folder directory under GLCache named 96c16aaad04bf94d0fca459bd20dd5b6 and all pre-compiled shader cache gets stored there. Even after removing the whole shadercache folder, there will be an empty folder created with that name. The driver doesn't write or read from the directory created by Steam, while the game process is running. Only pre-cached binaries get stored there. While playing the game. it relies on it's own, separate cache directory.

The driver stores it's own cache in a different directory under GLCache, named 019af8671d44d9a56d03da874f7d27eb. Steam doesn't request any shaders from that directory.

jkrhu commented 11 months ago

Retested on 535.98 and the problem is still present. It basically means there is no point in pre-caching on Nvidia. It's not being utilized at all.

jkrhu commented 10 months ago

Issue is still present on 535.104.05. The curious thing is that some games work correctly, while some don't.

Deus Ex: Human Revolution still has two GLCache directories. Even when I remove all shader cache, Steam will create a directory in there. Which stays empty, cause the game uses a different one.

I've started playing Resident Evil 8 recently, and curiously it does make use of precached shaders. There is only one directory under GLCache and it does work there.

kakra commented 10 months ago

Deus Ex: Human Revolution still has two GLCache directories. Even when I remove all shader cache, Steam will create a directory in there. Which stays empty, cause the game uses a different one.

Maybe this is created by launchers which have their own EXE file? And Steam just handles both? Does the other directory update file time stamps while playing the game?

jkrhu commented 10 months ago

Deus Ex: Human Revolution still has two GLCache directories. Even when I remove all shader cache, Steam will create a directory in there. Which stays empty, cause the game uses a different one.

Maybe this is created by launchers which have their own EXE file? And Steam just handles both? Does the other directory update file time stamps while playing the game?

I've had this problem recently in many DXVK games. FF 13, WD2, RE5, DX:HR. Some of these don't have launchers, some only launch them the first time.

From the shader log, I can tell that all game generated shaders are stored in a separate directory from the pre-cached ones.

Basically it registers all new shaders for the shader hit cache from the new directory and the server doesn't care of want any of those. It only cares about the one created by Steam, which only ever registers new shaders when I have received a new fossilize replay patch.

I have a huge pile of shaders and then a 100-200mb shader cache that I have created. I could tell immediately that the RE8 ran like butter compared to DX11 games, which all were supposed to be pre-cached. And then it stuttered heavily when I removed all cache.

I will need to try more VKD3D-Proton games like RE8 to see if they all behave correctly. It might be some issue with DXVK if that's the case.

jkrhu commented 10 months ago

I've tried Death Stranding: DC and it also behaves correctly, just like RE8. BF V also seems to work correctly, although it does have two directories, as the game runs in either DX11 or DX12. Would be interesting if it was pre-caching correctly only with DX12 on Nvidia.

I have a hunch the problem could actually be DXVK related. All the pre-caching problems seem to happen on DX9-DX11 games. Could we maybe get someone from the DXVK team to investigate? @kisak-valve

jkrhu commented 7 months ago

Hi! Issue still persists on Nvidia driver 545.29.06. Tested on Call of Duty 2 this time. This game doesn't have a launcher or anything like that. I saw there was some work on the caching system judging by changes in the shader log, but it still doesn't work correctly for Nvidia.

Two directories are still being created: 1st one:

[2023-11-26 18:19:17] Creating shader local hit cache directory: /home/jkrhu/.local/share/Steam/userdata/81766792/config/shaderhitcache/d8b7e2a1b76b53ed/51878d230f208e342e5e6cc01b705a74
[2023-11-26 18:19:17] d8b7e2a1b76b53ed / L2:51878d230f208e342e5e6cc01b705a74: 5428 shaders.
[2023-11-26 18:19:23] Done; 0 already registered, 5428 succeeded, 0 failed, 0 busy.
[2023-11-26 18:19:23] Server requested 0 shaders; uploading is enabled.

51878d230f208e342e5e6cc01b705a74 - This one stores all precached shader binaries.

2nd one:

[2023-11-26 18:19:23] Creating shader local hit cache directory: /home/jkrhu/.local/share/Steam/userdata/81766792/config/shaderhitcache/92758bde0dfe00a2/7cc7eebc4b8597e84ad240a5a4a56e18
[2023-11-26 18:19:23] 92758bde0dfe00a2 / L2:7cc7eebc4b8597e84ad240a5a4a56e18: 399 shaders.
[2023-11-26 18:19:23] Done; 0 already registered, 399 succeeded, 0 failed, 0 busy.
[2023-11-26 18:19:23] Server requested 0 shaders; uploading is enabled.

7cc7eebc4b8597e84ad240a5a4a56e18 - This one is created when I play the game. This is the only directory the game uses for caching shader binaries.

Pre-caching DXVK games on Nvidia provides no benefit to reduce in-game stuttering. VKD3D-Proton games pre-cache correctly.

If I delete all shader cache files and start the game, there will be an empty additional directory that only gets used when pre-caching.

I really hope after all this time, we can come up to some resolution. Perhaps it would be wise to just disable pre-caching on DXVK games running on Nvidia or at least let people disable pre-cache per title.

Two directories

DasLeo commented 2 months ago

Had the same issue last week on a fresh Arch Linux + nvidia-dkms install.

After compiling and downloading everything in Steam, Nvidia driver just ignores the files and creates its own GLCache in $HOME/.nv. As a result, Apex Legends stuttered as hell.

The second issue was the limit of the GLCache size in Nvidia driver, so the driver it was recreating the cache after each game start, but that's a driver thing and can be ignored here.

I was not able to figure out how to get the Nvidia driver or DXVK to use the Steam Shader cache files, so I ended up disabling all the pre-caching in Steam and set __GL_SHADER_DISK_CACHE_SKIP_CLEANUP=1 as a global ENV variable.

I would, however, really like to use the Steam shader cache as well or only.

kakra commented 2 months ago

I think there may be a misconception:

There's the shader pre-caching: It records shaders to be compiled and stores those pipelines in a cache. Fossilize will then re-run those cached pipelines from time to time, forcing the GPU driver to write the actual cache - which defaults to ~/.nv on NVIDIA and is quite limited in size, easily pushing out compiled shaders even from the very game you're running.

~/.nv will always exist because your desktop caches shaders there (and other non-Proton software).

But what may work wrong for you: Proton should relocate the NV shader cache to a game-specific folder, the same folder where DXVK stores its pipeline cache, and fossilize stores it's pipeline and media cache. In that same structure, you'll find nvidiav1: If it exists and becomes updated, that is where your GPU driver stores shaders from the game.

The fossilize data is crowd-sourced, so you can get shader pipelines in advance before encountering them in the game. Fossilize still needs to compile them on your driver, which is what becomes stored in the driver-specific folder. DXVK more or less does what fossilize does: Caching encountered shader pipelines, and pre-processing them while the game starts. The result also goes into the driver-specific folder.

It should look something like this, for example Elite Dangerous:

# ls -al 359320/
insgesamt 8
drwxr-xr-x   9 kakra kakra  162 22. Apr 04:00 ./
drwxr-xr-x 103 kakra kakra 4096  8. Mär 05:43 ../
drwxr-xr-x   2 kakra kakra   41 23. Sep 2023  DXVK_state_cache/
drwxr-xr-x   2 kakra kakra    6 31. Aug 2020  fozmediav1/
drwxr-xr-x   4 kakra kakra  182 22. Apr 04:00 fozpipelinesv6/
drwx------   3 kakra kakra   54 20. Apr 01:54 mesa_shader_cache_sf/
drwxr-xr-x   3 kakra kakra   21 20. Apr 01:54 nvidiav1/
drwxr-xr-x   2 kakra kakra    6 26. Jan 2019  pipeline_cache/
drwxr-xr-x   2 kakra kakra    6 23. Sep 2023  steam_shader_cache/

So, NVIDIA will always create its own cache but it should do so in the game-specific folder if running through Proton. Steam's shader pre-caching doesn't download compiled shaders, it just downloads "compile instructions" (if you want to call it that way) collected from games and crowd-sourced, so YOUR system can compile it in advance before starting the game (which the fossilize background service does).

If you just updated your drivers, you will clearly see stutter because shaders have to be recompiled in the background (this is driver and GPU specific). If you wait some time for fossilize doing its job after updating the driver, the stutters should be gone (unless Proton doesn't properly setup the folders and env variables to use these folders).

The problem described in this report is rather that shader pre-compiling creates a different folder in nvidiav1 than what is used by the game - resulting in stutter.

jkrhu commented 2 months ago

Yes, this looks like a different issue to mine.

The Steam overrides for Nvidia caches work correctly for me. The problem is that fossilize cache from some games (mostly DX9-11, but not all of them) is compiled to a separate GLCache directory within nvidiav1. Then when I launch the title, NV shader compiler will create another GLCache directory tree and store the shader binaries in there. Thats why it's not always beneficial to wait until fossilize replay is done. Some of them are unnecessarily huge and take long to compute. The hit caching, steam swarm system will gather my caches and use it to the benefit of others, but even my own caches will not go to the directory used by the game.

I'm still unsure if the problem is Nvidia or DXVK or maybe Steam not overriding it correctly. The DX12 titles I've tested do replay to the correct directories at least. It's not a big of a deal, since Graphics Pipeline Libraries work great most of the time. This issue persists through different Fedora releases, fresh installs, Pop OS as well. I assume someone took a look at it at some point, but the problem could be just Nvidia shader compiler weirdness.

Going back to your issue. I've never encountered it myself. Looks like Stream doesn't have the perms to override those directories. Also afaik the /.nv directory hasn't been the default for some time. It's always empty. All of my system caches are stored in /.cache/nvidia/GLCache but that could as well be distro specific.

kakra commented 2 months ago

I'm still unsure if the problem is Nvidia or DXVK or maybe Steam not overriding it correctly.

The directory I posted above is from Elite Dangerous, a DX11 title using DXVK. I'm seeing just one folder inside nvidiav1/GLCache, so it is building just one cache folder.

Maybe this happens if Steam itself runs inside some container like flatpak? Do you use a flatpak version or something similar? Or do you have a second GPU?

Whatever I described in my previous comment https://github.com/ValveSoftware/steam-for-linux/issues/9803#issuecomment-1623789989 is gone, except for some games still recompiling shaders on each start. But stutters are gone, no "duplicate" GLCache directories (and I think, I never had them).

So this may be a specific problem the NVIDIA driver has with some environments. So maybe we should look into what the difference is between those systems and how they are (and Steam) are set up.

jkrhu commented 2 months ago

I've had this happen in FF 13, WD2, RE5, DX:HR, CoD2 at least. Haven't tried Elite Dangerous.

I know there was a problem with some Nvidia driver that didn't store caches correctly, but this isn't that unfortunately.

I'm currently on Fedora 40 and the latest 550.76 driver. I do not have a second GPU. Steam is installed as RPM from rpm-fusion. I will try to compare with flatpak Steam and will let you know.

kakra commented 2 months ago

Okay, I can try RE2, RE3 (similar engine?) and DX:HR (might be in my library), also HZD (which still compiles shaders on the loading screen each time I run it), then inspect the shader directories before and after.

I'm running Gentoo, installed in the "use Steam runtime" variant. It's not running in a container. Driver is NVIDIA 550.76 on kernel 6.6.28.

I've never run the flatpak version (and won't) but I could imagine that those interfere with the GLCache signature hash. Also, does your launcher maybe set some LD_PRELOADs to redirect libraries? I've seen some reports where a lot of those have been set and it caused problems with the graphics driver. My Gentoo Steam install doesn't do that.

jkrhu commented 2 months ago

Unfortunately the behaviour between a system rpm installation and flatpak is exactly the same. I've tried CoD2 and it's exactly as I've described. One GLCache subfolder for replayed binaries (~35mb), another immediately after launching the game. I've noticed both single and multi player are stored together, under that one directory. Binaries build up over time in that place.

Another thing is that after deleting the whole shadercache folder, both subfolders will still be created upon launching the game. One completely empty (since no fossilize replay) and the other used by the game.

Only LD_PRELOAD is with steam gameoverlayrenderer. Honestly I'm not sure what to do about it. Maybe Steam assigns directory based on something, which is no longer correct or it's some Nvidia shader compiler library thing, no idea.

Coldblackice commented 1 month ago

I know there was a problem with some Nvidia driver that didn't store caches correctly, but this isn't that unfortunately. @jkrhu

Do you have a link or reference to where you saw this? Not disputing, just curious as I'm trying to get to the bottom of similar issues.

jkrhu commented 1 month ago

I know there was a problem with some Nvidia driver that didn't store caches correctly, but this isn't that unfortunately. @jkrhu

Do you have a link or reference to where you saw this? Not disputing, just curious as I'm trying to get to the bottom of similar issues.

The 535.54.03 driver was the problematic one. Games with bundled PSO caches would recompile every time. It got fixed with 535.86.05. If you type the driver number and shader cache, there are many mentions it was completely busted.

https://github.com/ValveSoftware/steam-for-linux/issues/9748