ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.
MIT License
583 stars 47 forks source link

fosselize_replay eats all RAM when background processing shaders #210

Open tdaven opened 1 year ago

tdaven commented 1 year ago

This seems to happen in multiple games. Latest issue occurs with "Red Dead Redemption 2" as well as "The Elder Scrolls V: Skyrim Special Edition".

Both games, cannot complete shader processing with out eating all system memory.

System Details:

Steam was installed from the rpmfusion.org repo.

The issue looks very similar to #194 , #198 or #84.

Something seems to cause it to not limit memory use. If I watch the memory use and enable/disable processing, you can aid it through this process as it is making progress. It just doesn't complete before it uses all memory.

Happy to help troubleshoot with direction. Haven't found anything particularly helpful in ~/.local/share/Steam/logs/shader_log.txt. Not sure how the manually run fosselize_replay to troubleshoot further.

sampie commented 1 year ago

I have the same issue. I am running Ubuntu 22.10.

marcosbitetti commented 1 year ago

Same issue here Pop!_OS 22.04 LTS nVidia GTX 1650 OC 4GB nVidia driver version 525.60.11

tdaven commented 1 year ago

This an example of what I see happen with memory usage. Just happened again after updating steam. There where 6 fosselize_replay processes each consuming over 9+GB of RAM. Swap had been disabled in a hope the fosselize processes would just OOM but that didn't happen.

image

The big spike fell due to the fosselize processes being killed manually.

kakra commented 1 year ago

@tdaven Does your kernel have /proc/pressure/memory? Does this happen while fossilize is background processing the shaders (Steam is idle) or foreground processing shaders (starting a game with the Vulkan shaders dialog running)?

tdaven commented 1 year ago

@tdaven Does your kernel have /proc/pressure/memory? Does this happen while fossilize is background processing the shaders (Steam is idle) or foreground processing shaders (starting a game with the Vulkan shaders dialog running)?

@kakra Yes. Standard Fedora 37 kernel which has /proc/pressure/memory.

For example:

[tdaven@desktop ~]$ cat /proc/pressure/memory 
some avg10=0.00 avg60=0.00 avg300=0.00 total=29518146
full avg10=0.00 avg60=0.00 avg300=0.00 total=28791057
[tdaven@desktop ~]$ 

It happens during background processing typically. Usually triggered after an update for the game is installed and fosselize is triggered.

kakra commented 1 year ago

If you watch cat /proc/pressure/{memory,io}, do these values raise before the problem hits? Because then fossilize should throttle itself down by putting some threads into "stopped" phase... I've had a lot of similar problems before PSI support was introduced, and no problems since then. Actually, I inspired to introduce PSI support into fossilize. So I wonder what's different about your system? Also, the high memory usage should have been fixed a long time ago by sharing memory between processes properly.

Does your kernel have transparent hugepages force-enabled? If cat /sys/kernel/mm/transparent_hugepage/enabled says always, then try setting it to madvise or never. It could explain why memory usage kind of "explodes".

tdaven commented 1 year ago

Transparent huge pages:

[tdaven@desktop ~]$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

I see /proc/pressure/io change, but memory seems to always be zero, besides the total. This used to not happen on an older computer. I only started having this problem on this newer system which has more cores and memory.

Noctis-Bennington commented 1 year ago

I'm having the same problem. Image here

OS: Ubuntu 22.04 CPU: AMD Ryzen 7 4000 GPU: Radeon RX5600m

kakra commented 1 year ago

I'm having the same problem.

The memory usage is shared between processes. You need to look at the PSS size, or consider shared memory usage, too.

Noctis-Bennington commented 1 year ago

I'm having the same problem.

The memory usage is shared between processes. You need to look at the PSS size, or consider shared memory usage, too.

Sorry, I'm lost with your answer. There are four process sharing memory far as I see. And these process don't stop consuming memory until get OOM (this happens in seconds).

cachandlerdev commented 1 year ago

I have noticed this issue appearing on Fedora in both Skyrim and No Man's Sky, where Steam rapidly starts eating my 32GB of ram while processing Vulkan shaders and will happily lock up my pc unless I manually kill the process in time.

Cpu: AMD Ryzen 7 7700X Gpu: Nvidia RTX 2070 Super

Edit: The memory bug also occurs while processing Battlefield 1's shaders.

Edit 2: This happens regardless of whether foreground (launching the game) or background processing is happening.

Noctis-Bennington commented 1 year ago

What I saw is the fact that it happens only in a few games, but big games. Left 4 Dead 2 is one of them.

WildPenquin commented 1 year ago

This has recently started to happen to me up to the point I can not leave Steam running. It will eat up to 30GiB of RAM, which makes the computer nearly unusable.

It will always start many, many threads of fossilize_replay (but the thread usage is not as much of a problem, the RAM usage is). Seems like AppID 346110 (Ark: Survival Evolved) (Proton issue https://github.com/ValveSoftware/Proton/issues/3218) is triggering this problem more often than others. I've seen huge amount of RAM usage for other AppIDs, too. Ark:SE is triggering this so often that I've considered uninstalling it.

nwestervelt commented 1 year ago

This has started happening to me on Arch Linux with Deep Rock Galactic recently. Runs out of memory if the shaders are processing in the background or in the foreground before the game launches.

If I disable shader precaching, it happens while the game is running (I assume from the shaders compiling while the game is open).

Current mesa version: 1:23.1.6-4

kisak-valve commented 1 year ago

Hello @WetWayne, mesa 23.1.6 (specifically) has a known memory leak which should hopefully be fixed in the next point release.

Reference: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9599

nwestervelt commented 1 year ago

@kisak-valve In that case, disregard my comment.

Hubro commented 1 year ago

This has been happening to me the last few weeks, it seems to have started after I installed Armored Core 6.

A bunch of fossilize_replay processes will be running and consuming a ton of RAM, and sometimes they suddenly consume everything I have (64 GB) and lock up my PC for a few seconds before the Steam process is killed. Sometimes this even happens in a loop, where every 10-20 seconds or so, my PC will freeze for a few seconds and Steam will get killed and automatically restart, until I kill Steam myself.

It happens right after I start Steam, as well as randomly in the background as I'm doing other stuff. Strangely it hasn't stopped happening after I uninstalled Armored Core.

System Information: https://gist.github.com/Hubro/10e2be14104aecb0f0e42f4ec9fe4c82

I'm also running mesa 23.1.6, so hopefully it will resolve itself soon.

Danternas commented 11 months ago

Same issue. Restarting Steam seem to solve the issue better than killing the process. It seem to be stuck in some kind of loop, going out of memory and then restarting. Even with 32gb of ram it fills the ram in seconds.

kakra commented 11 months ago

Lately, I found that KDE baloo may fight over resources with fossilize. Are you running KDE with baloo? Try stopping it and see if it helps.

CaptaiNiveau commented 7 months ago

I ran into this now. I switched my desktop to Arch Linux on zfs, before I didn't notice an issue like this. If I kill Rocket League and start it again without waiting long, the shader compilation eats up all RAM and kills most of my open applications.

Not sure if this is related to zfs, but it's the only major thing that changed compared to my past installs.

kakra commented 7 months ago

I think this is because fossilize uses more threads if it starts working in the foreground - that is, if you start the game and Steam shows the progress dialog for fossilize. If you have a spare partition with another filesystem, try moving the shader caches there: I have a spare SSD formatted with xfs, and moved $HOME/.steam/steam/steamapps/shaderchache there, and then created a symlink from the old to the new location. This also takes away some filesystem lock contention from the game library while the game reads and writes shader caches, and reads game data at the same time (I'm using btrfs for my rootfs).

If you don't have a spare partition, maybe it helps creating an additional zvol dedicated for the shadercache, so this will split IO operations to a dedicated volume. E.g., in my case I split IO operations in btrfs to dedicated subvolumes which allows the system to run into less lock contentions during heavy IO load.

Also, check if /proc/pressure/io exists: If it does, fossilize will use that to reduce cache thrashing if IO latency spikes, so the system would not start to stall. If it doesn't exist, you may need to enable it at boot time. Check your distribution docs how to do that (the feature is called PSI: pressure stall information).

Fossilize creates a lot of random reads and writes due to how it uses its files (memory mapped and shared memory). zfs and btrfs are particularly bad at those patterns because they are copy on write filesystems. fossilize optimizes the reads at least by pre-caching everything needed sequentially but the writes are still pretty non-linear, and that's a very bad pattern for copy on write extents. Because those writes are small, random, and slow, dirty cache data piles up in the page cache, leading to high memory usage. Accelerating those writes should help (e.g., by using a dedicated non-cow filesystem, or using other write accelerators like bcache on btrfs with allocation hint patches, or slog on zfs). Take note that zfs itself needs a lot of memory for being fast: a 32 GB desktop system may be way too small for using zfs with high random write loads effectively. Also take note that due to the small random writes, SSDs really cannot do magic here, they are slow at those IO patterns (still faster than spinning, but slow with high latency).