ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.
MIT License
554 stars 46 forks source link

Please make fossilize_replay ramp in a performant way #248

Closed hparadiz closed 2 months ago

hparadiz commented 2 months ago

Feature Request

I confirm:

Description

Please make fossilize_replay check load average against amount of cores and avoid launching threads in normal priority

Justification

When another high cpu usage task is happening the machine can lock up due to many fossilize_replay threads running in the background

Risks

Low

References

https://github.com/ValveSoftware/Proton/issues/7000

kakra commented 2 months ago

@hparadiz Please check with schedtool (pidof fossilize_replay) which priority the processes are using. They should be scheduled as batch already, using a very low group nice priority (not visible in top or schedtool but in /proc/PID/autogroup).

Additionally, if supported by the system, fossilize will throttle itself using PSI if it detects spikes in IO latency to prevent locking up the desktop for multiple seconds. IO latency will spike if fossilize starts to dominate the page cache, RAM becomes low, or your filesystem cannot write data fast enough. In that case, it will put fossilize processes into stopped state. If your kernel isn't compiled with PSI, this won't work.

The last thing is memory usage: If this gets high, and you don't allow swapping, fossilize tends to dominate the page cache, this will result in high desktop latency and the system may take many minutes to recover from that memory shortage: It will appear frozen but with disk activity. In that case, you should enable swap and don't run with vm.swappiness=0.

Also, in the background, it should not use more than 4-5 processes - but I've seen reports here where this doesn't seem to work, and if fossilize runs on all cores, it will overwhelm the memory subsystem. Without process autogrouping or PSI, it will also overwhelm CPU and disk - where the latter is the biggest problem. Maybe your distribution removed the core limit for background processing from Steam to "make it magically faster"?

Fossilize uses shared memory between all processes (so memory usage is lower than it may appear), but it is not backed by a temporary disk file, it's just anonymous memory, so the kernel needs swap memory to compensate for high fossilize activity.

To summarize:

hparadiz commented 2 months ago

I'm on Gentoo.

My machine has 32GB of ram. CPU is an AMD Ryzen 5950X

Swap is /dev/nvme0n1p3 2887680 137105407 134217728 64G Linux swap vm.swappiness = 60

Seeing 8 threads going image

Looks like it is using

$ schedtool $(pidof fossilize_replay) 
PID 30811: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9455: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9315: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9216: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9189: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9147: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  9113: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  8813: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
PID  8806: PRIO   0, POLICY B: SCHED_BATCH   , NICE  19, AFFINITY 0xffffffff
$ cat /proc/6712/autogroup
/autogroup-27307 nice 19

I'll look into my kernel for PSI. I'm running self compiled 6.9.6

I was compiling chromium at the time.

Thank for you the lovely detailed response.

kakra commented 2 months ago

I'm on Gentoo, too, and it works great here.

So everything looks fine on your system (autogroups are there, batch scheduling works, swappiness is fine, swap is there), except you need to look if /proc/pressure/io exists (PSI for IO) which fossilize is watching during the process. You can watch cat /proc/pressure/io yourself while it is running. "some" means that some processes are blocking on IO, "all" counts all processes waiting on IO. "total" is the total number of nanoseconds, the remaining numbers are the averages over a sliding window. fossilize watches the "some" line.

Since you are on Gentoo and I feel like you enjoy some technical details for that reason: Actually, I had the idea to add the PSI feature into fossilize (because without, my system was really struggling with fossilize, I did the initial poc patch, but the main author greatly improved it and knew a lot better where to put the control knobs), and I also initiated to add autogroup nice because nice'ing individual processes with an autogroup kernel has absolutely zero effect (because nice only works within a process group, the group itself needs to become nice). The batch scheduler gives processes slightly longer time slices for better CPU cache hit rates (I added that patch to fossilize), at the cost of being more often preempted by other processes (which gives them a slight priority disadvantage over SCHED_OTHER aka interactive processes, which is what we actually want).

All these changes finally made fossilize to be an absolute non-issue for me: no matter what it does, there's no impact on the system (except the HDD making some more noises).

If you're using btrfs or bcache, you may want to look at my kernel patches for Gentoo. BTW: There's a Steam-centric kernel patch, too, which probably won't apply to 6.9: I'm only maintaining those patches for LTS kernels. So the next round of patches will come end of December or early January.

kakra commented 2 months ago

I was compiling chromium at the time.

The issue here is more likely Chromium. The linker phase of Chromium is begging for RAM. Try removing -pipe from your CFLAGS and avoid compiling Chromium in tmpfs if you've set this up. You can set a different portage location per package using package.env.

hparadiz commented 2 months ago

It was PSI. I compiled it into my kernel and now things seem more performant.

I'm all ears for any other kernel things I should check for.

kakra commented 2 months ago

It was PSI. I compiled it into my kernel and now things seem more performant.

Great to see that this fixed it - that is the intention of the PSI support in fossilize.

I'm all ears for any other kernel things I should check for.

Then maybe look at my kernel patches I'm using for Gentoo (but I'm using only LTS kernels): https://github.com/kakra/linux/pulls