ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.
MIT License
585 stars 47 forks source link

Launching Steam immediately caused `fossilize_replay` to consume all available memory #230

Closed rhoot closed 1 year ago

rhoot commented 1 year ago

Not sure whether to treat this as a Steam or fossilize issue, but trying here first:

When I launched Steam today, my system got very laggy and unresponsive. Eventually it froze completely (video, audio, everything) for a few seconds until oom-killer kicked in. So I closed Steam and opened it again while keeping an eye on RAM.

I have 32 GiB of RAM plus 8 GiB of swap. Before opening Steam 29 GiB of RAM was available. Within 5 seconds fossilize_replay had consumed all of it, as well as the swap. oom-killer then kicked in and killed some processes, a few seconds later it was at 100% usage again, repeat for a couple of minutes.

System information Computer Information: Manufacturer: EVGA Corp. Model: X570 FTW WIFI Form Factor: Desktop No Touch Input Detected Processor Information: CPU Vendor: AuthenticAMD CPU Brand: AMD Ryzen 9 5950X 16-Core Processor CPU Family: 0x19 CPU Model: 0x21 CPU Stepping: 0x0 CPU Type: 0x0 Speed: 5083 MHz 32 logical processors 16 physical processors Hyper-threading: Supported FCMOV: Supported SSE2: Supported SSE3: Supported SSSE3: Supported SSE4a: Supported SSE41: Supported SSE42: Supported AES: Supported AVX: Supported AVX2: Supported AVX512F: Unsupported AVX512PF: Unsupported AVX512ER: Unsupported AVX512CD: Unsupported AVX512VNNI: Unsupported SHA: Supported CMPXCHG16B: Supported LAHF/SAHF: Supported PrefetchW: Unsupported Operating System Version: "EndeavourOS Linux" (64 bit) Kernel Name: Linux Kernel Version: 6.4.12-zen1-1-zen X Server Vendor: The X.Org Foundation X Server Release: 12302000 X Window Manager: KWin Steam Runtime Version: steam-runtime_0.20230606.51628 Video Card: Driver: AMD AMD Radeon RX 6900 XT (navi21, LLVM 15.0.7, DRM 3.52, 6.4.12-zen1-1-zen) Driver Version: 4.6 (Compatibility Profile) Mesa 23.1.6 OpenGL Version: 4.6 Desktop Color Depth: 24 bits per pixel Monitor Refresh Rate: 174 Hz VendorID: 0x10de DeviceID: 0x1e07 Revision Not Detected Number of Monitors: 2 Number of Logical Video Cards: 2 Primary Display Resolution: 3440 x 1440 Desktop Resolution: 3440 x 2520 Primary Display Size: 31.89" x 13.78" (34.72" diag), 81.0cm x 35.0cm (88.2cm diag) Primary VRAM: 16384 MB Sound card: Audio device: USB Mixer Memory: RAM: 32018 Mb VR Hardware: VR Headset: None detected Miscellaneous: UI Language: English LANG: en_US.UTF-8 Total Hard Disk Space Available: 255884 MB Largest Free Hard Disk Block: 82750 MB Storage: Number of SSDs: 2 SSD sizes: 2000G,1000G Number of HDDs: 0 Number of removable drives: 0
kisak-valve commented 1 year ago

Hello @rhoot, can you check if rebuilding mesa with https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24949 helps (or test mesa 23.1.5)?

rhoot commented 1 year ago

It's hard to say whether building with that patch applied helped or not. The issue worked itself out after a couple of minutes yesterday, so presumably the only things to compile today were the files for whatever games got updated.

So immediately after launching Steam today I didn't have this issue. But then some game updates got installed and memory use started to rise again. For the most part fossilize seems to sit around a much more reasonable 1 GiB of use, but there were a few spikes that brought it up right to the limit and then drop again almost immediately. I think the highest I saw was around 30 GiB used out of 31.3 (on my system, not just fossilize). Similar situation to yesterday: Only about 3 GiB was used before launching Steam.

Next time it happens I'll try to see if i can figure out which game may have triggered those spikes from the fossilize command line. I didn't realize it contained the appid in the path of an argument passed into it until after it had stopped spiking.

Edit: Actually I'll just try wiping some shader caches later.

rhoot commented 1 year ago

Okay, so I disabled the shader cache and re-enabled it again (in Steam settings). That caused it to start compiling some shaders. It did rise back up to using basically all my memory (on mesa 23.1.6, with the patch from that MR applied). The game it was compiling shaders for was Deep Rock Galactic (appid 548430).

Eventually it dropped back down a bit, but then just... kind of stalled out:

image

You can see the remnants of the memory usage in that screenshot too. How the swap is completely full, and 100% of the free memory is now used for cache. As best I can tell, Steam was using the CPU to download cached shaders, so that CPU usage is likely expected.

But the fossilize processes just stayed sleeping like that for several minutes, until I eventually closed Steam. Once I loaded it up again I managed to snap this before my system completely froze for a few seconds again until oom-killer kicked in and killed some processes:

image

Edit: For reference/comparison, this is after closing Steam:

image

rhoot commented 1 year ago

I just tried building/installing mesa 23.1.5. I have had shaders compiling for over an hour now, and no fossilize_replay process has ever gone above 544M in the resident set. So it definitely seems like a regression in 23.1.6.

kakra commented 1 year ago

You can see the remnants of the memory usage in that screenshot too. How the swap is completely full, and 100% of the free memory is now used for cache. As best I can tell, Steam was using the CPU to download cached shaders, so that CPU usage is likely expected.

This looks a lot like the memory behavior I'm seeing from 6.x kernels not only with fossilize but all sorts of processes. Especially here, it looks like fossilize does not use shared memory at all, and memory usage is too high by a factor of 10.

See if your kernel has /sys/kernel/mm/transparent_hugepage and if yes, run as root after a fresh reboot and before starting Steam:

echo within_size >/sys/kernel/mm/transparent_hugepage/shmem_enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo 64 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
echo 8 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
echo 32 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

This should reduce memory pressure but if you're still seeing high swap pressure, try booting with kernel cmdline cgroup_disable=memory and then try this test again. How to adjust the kernel cmdline or set the above settings permanently, is specific to your distribution.

So it definitely seems like a regression in 23.1.6.

This is very possible, too.

rhoot commented 1 year ago

This should reduce memory pressure but if you're still seeing high swap pressure, [...]

To be clear, swap didn't start filling up until my physical RAM had been fully consumed.

Downgrading to mesa 23.1.5 without changing anything else about the system also caused shared memory usage to go up. This is what it looks like after the downgrade:

image

NextGenRyo commented 1 year ago

I am suffering from this exact same problem after a system update on manjaro. In that update was mesa 23.1.6. Any updates on this?

kisak-valve commented 1 year ago

This is a mesa/RADV regression limited to mesa 23.1.6.

Caused by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24579, and should be fixed by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24949 and https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24896.

The practical fix is to update to mesa 23.1.7 or newer, which includes these.

There's nothing more to be done on Fossilize's side.

rhoot commented 1 year ago

The practical fix is to update to mesa 23.1.7 or newer, which includes these.

There's nothing more to be done on Fossilize's side.

Yep, 23.1.7 seems to fix it. Thanks!