ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.
MIT License
585 stars 47 forks source link

Please make fossilize_replay ramp in a performant way #248

Closed hparadiz closed 4 months ago

hparadiz commented 5 months ago

Feature Request

I confirm:

Description

Please make fossilize_replay check load average against amount of cores and avoid launching threads in normal priority

Justification

When another high cpu usage task is happening the machine can lock up due to many fossilize_replay threads running in the background

Risks

Low

References

https://github.com/ValveSoftware/Proton/issues/7000

kakra commented 5 months ago

@hparadiz Please check with schedtool (pidof fossilize_replay) which priority the processes are using. They should be scheduled as batch already, using a very low group nice priority (not visible in top or schedtool but in /proc/PID/autogroup).

Additionally, if supported by the system, fossilize will throttle itself using PSI if it detects spikes in IO latency to prevent locking up the desktop for multiple seconds. IO latency will spike if fossilize starts to dominate the page cache, RAM becomes low, or your filesystem cannot write data fast enough. In that case, it will put fossilize processes into stopped state. If your kernel isn't compiled with PSI, this won't work.

The last thing is memory usage: If this gets high, and you don't allow swapping, fossilize tends to dominate the page cache, this will result in high desktop latency and the system may take many minutes to recover from that memory shortage: It will appear frozen but with disk activity. In that case, you should enable swap and don't run with vm.swappiness=0.

Also, in the background, it should not use more than 4-5 processes - but I've seen reports here where this doesn't seem to work, and if fossilize runs on all cores, it will overwhelm the memory subsystem. Without process autogrouping or PSI, it will also overwhelm CPU and disk - where the latter is the biggest problem. Maybe your distribution removed the core limit for background processing from Steam to "make it magically faster"?

Fossilize uses shared memory between all processes (so memory usage is lower than it may appear), but it is not backed by a temporary disk file, it's just anonymous memory, so the kernel needs swap memory to compensate for high fossilize activity.

To summarize:

hparadiz commented 5 months ago

I'm on Gentoo.

My machine has 32GB of ram. CPU is an AMD Ryzen 5950X

Swap is /dev/nvme0n1p3 2887680 137105407 134217728 64G Linux swap vm.swappiness = 60

Seeing 8 threads going image

Looks like it is using

$ schedtool $(pidof fossilize_replay) 
PID 30811: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9455: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9315: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9216: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9189: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9147: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9113: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  8813: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  8806: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
$ cat /proc/6712/autogroup
/autogroup-27307 nice 19

I'll look into my kernel for PSI. I'm running self compiled 6.9.6

I was compiling chromium at the time.

Thank for you the lovely detailed response.

kakra commented 5 months ago

I'm on Gentoo, too, and it works great here.

So everything looks fine on your system (autogroups are there, batch scheduling works, swappiness is fine, swap is there), except you need to look if /proc/pressure/io exists (PSI for IO) which fossilize is watching during the process. You can watch cat /proc/pressure/io yourself while it is running. "some" means that some processes are blocking on IO, "all" counts all processes waiting on IO. "total" is the total number of nanoseconds, the remaining numbers are the averages over a sliding window. fossilize watches the "some" line.

Since you are on Gentoo and I feel like you enjoy some technical details for that reason: Actually, I had the idea to add the PSI feature into fossilize (because without, my system was really struggling with fossilize, I did the initial poc patch, but the main author greatly improved it and knew a lot better where to put the control knobs), and I also initiated to add autogroup nice because nice'ing individual processes with an autogroup kernel has absolutely zero effect (because nice only works within a process group, the group itself needs to become nice). The batch scheduler gives processes slightly longer time slices for better CPU cache hit rates (I added that patch to fossilize), at the cost of being more often preempted by other processes (which gives them a slight priority disadvantage over SCHED_OTHER aka interactive processes, which is what we actually want).

All these changes finally made fossilize to be an absolute non-issue for me: no matter what it does, there's no impact on the system (except the HDD making some more noises).

If you're using btrfs or bcache, you may want to look at my kernel patches for Gentoo. BTW: There's a Steam-centric kernel patch, too, which probably won't apply to 6.9: I'm only maintaining those patches for LTS kernels. So the next round of patches will come end of December or early January.

kakra commented 5 months ago

I was compiling chromium at the time.

The issue here is more likely Chromium. The linker phase of Chromium is begging for RAM. Try removing -pipe from your CFLAGS and avoid compiling Chromium in tmpfs if you've set this up. You can set a different portage location per package using package.env.

hparadiz commented 4 months ago

It was PSI. I compiled it into my kernel and now things seem more performant.

I'm all ears for any other kernel things I should check for.

kakra commented 4 months ago

It was PSI. I compiled it into my kernel and now things seem more performant.

Great to see that this fixed it - that is the intention of the PSI support in fossilize.

I'm all ears for any other kernel things I should check for.

Then maybe look at my kernel patches I'm using for Gentoo (but I'm using only LTS kernels): https://github.com/kakra/linux/pulls

sandikata commented 1 month ago

PSI is not always a case or a "big deal". Normally you won't get any advantage with such CPU and RAM amount.

PSI is mostly for underpowered configurations (Steam Deck) or any laptop.

I am a gentoo user too. But fossilize driving me crazy. You have several choices but none of them are optimal at all.

  1. To run fossilize on single core (it will take forever to recompile shaders, if you have many Vulkan games).
  2. To run fossilize on 4 - 8 cores
  3. To run fossilize on all available cores (it will be faster, but with heavy load)

So, even with all cores, it takes on my PC about 30 minutes, Just for Assetto Corsa, Counter Strike, Counter Strike Source, Counter Strike 2, Dead Island, Flatout Collection, Wreckfest, Metro Collection.

I noticed that, it could take much more than usual time, if there are several vulkan api implementations like amdvlk and amdgpu-pro-vulkan.

Please share your thoughts about this.

My configuration -> https://gist.github.com/sandikata/b5594bd79b35fc5dd556c3ff26189948

kakra commented 1 month ago

If you're using btrfs (according to your system info), the problem may actually be not having PSI in the kernel. The write patterns caused by fossilize are very aggressive to btrfs (random reads and writes via memmap) and its kernel memory allocator, leading to IO stutters. This has been a problem with KDE balloos index database but also with fossilize.

So your choices are probably:

This is most likely not a CPU-usage issue. fossilize will use extremely low-priority CPU, and it also creates its own process group for the auto-grouping scheduler so fossilize will only acts as a single process fair-share CPU user to the rest of the system (priority-wise, it will still use multiple cores and processes).

PSI is actually NOT a system for underpowered configurations. It's a system for processes to detect if they themselves are going to cause bottleneck situations because the system is otherwise busy, or let an admin plan for better resource allocation sharing - and such processes and services can then take proper countermeasures, e.g. pausing IO (which is what fossilize does), or flush caches (to reduce memory pressure of the dirty cache), or reduces threads. It's similar to watching the loadavg but instead PSI can look at each bottleneck individually or even per process.

sandikata commented 1 month ago

I am not sure if i know a way to move specific data from steam to different location.