fossilize_replay should use batch scheduling if possible and share resources better

ValveSoftware / Fossilize

A serialization format for various persistent Vulkan object types.

MIT License

585 stars 47 forks source link

fossilize_replay should use batch scheduling if possible and share resources better #99

Closed kakra closed 3 years ago

kakra commented 4 years ago

When using schedutil -B $(pgrep fossilize) to put the replayer into batch scheduling mode, it utilizes my CPU better (99.9 to 100% per core instead of 91-98%), and it results in better IO rates writing the data back. This is probably because batch scheduling gives processes bigger time slices at the cost of latency but with better CPU cache hit rate in turn. This also results in better IO coalescing. Additionally, batch jobs should have a slight scheduling penalty over interactive processes, resulting in better system responsiveness while those are running.

But my biggest concern is with it's CPU weight and memory usage: Because those jobs use every core in my system, their CPU share is quite high above other processes. This even continues to work with a high CPU share while in games, and it seems to draw a lot of processing power from CPU and GPU even tho the processes run with nice=19.

So let's inspect this a little further: Modern distro kernels usually use auto-group CPU scheduling, especially when tailored for desktops. This makes sense, as CPU shares are distributed evenly among groups of processes instead of single processes: This essentially makes the scheduler see a complete application packages with all its sub-processes as a single process when it comes to distributing CPU shares. But this also means that nice=19 has absolutely no value because it is considered only among its own group of processes, the group itself will still receive its fair share - so a game running concurrently with the group of replay processes will only get 50% of the CPU. This has become very apparent to me since KDE switched to using slices and scopes for managing apps started from Plasma: I now see an app.slice which shares its total CPU bandwidth with other slices in the system, resulting in the share left to it evenly distributed between each scope running inside, with Steam and all its sub-processes being one such scope. I think Gnome deployed something similar. Maybe Steam just needs to take care that fossilize_replay doesn't create its own auto-group for the scheduler? I'm not if it does.

Also, when looking at memory accounting, I see the process group accounting for over 21 GB of memory on my system. The processes RSS itself only adds up to maybe roughly 8 GB. This is because memory accounting also accounts the page cache occupied by these processes. Or in other words: The replay jobs thrash the page cache, dominating it above other processes of the system. This slows loading times in games a lot.

A way around this would be to put this background job into its own cgroup, and giving that cgroup a much lower CPU share than default. Also, the cgroup could limit the amount of memory used, tho this has to be selected very very carefully or you negate its effects (as in "priority inversion"). A good compromise would be if you could use systemd API on supported distros to create a slice by a defined name for the fossilize jobs. Then, users could create a systemd slice unit to adjust the settings to their preference, limiting CPU and IO shares that way, maybe limiting its share to the bandwidth of one core (which applies only if other processes actually claim the rest). This will open an opportunity for users to experiment with some settings and see what works best, then maybe share their results here. On non-systemd systems, this could either be a no-op, or you could programmatically create a cgroup. I'd opt-in for the no-op option because most distros switched to systemd anyways. To me, it looks like the group of fossilize_replay processes always get a total of 50% CPU share when running concurrently with a game, and this is probably due to auto-grouping or desktop environments putting each app into a systemd scope.

Another question is: Why doesn't fossilize_replay stop when I start playing a game?

BTW: My system probably doesn't use auto-group scheduling because I'm using the CK patchset. Still, nice=19 doesn't have a very big impact because fossilize_replay still seems to use GPU (according to nvidia-smi), and that doesn't know of niceness. So maybe fossilize_replay should not use GPU at all, or it should stop using GPU as soon as a game is being played?

IOW, nice doesn't do what one would expect on modern kernels. See man 7 sched, section "The autogroup feature". Maybe use the autogroup nice feature described there? However, that may not work when apps are already running within a slice or scope.

Plagman commented 4 years ago

It really shouldn't be using GPU, what is being shown in nvidia-smi exactly?

kakra commented 4 years ago

This is a snapshot from this night, apparently the process is done now so I can no longer get more details. But I think I've seen fossilize_replay take more GPU memory before (which would be bad because DXVK/NVIDIA driver fall back to sysmem more early then). But it looks like it hardly uses GPU % but I didn't closely watch this.

Sun Oct 18 02:16:46 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.26.01    Driver Version: 455.26.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:01:00.0  On |                  N/A |
| 33%   43C    P0    26W / 130W |   1379MiB /  5910MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1214      G   /usr/libexec/Xorg                 976MiB |
|    0   N/A  N/A      1793      G   /usr/bin/kwin_x11                  89MiB |
|    0   N/A  N/A      1841      G   /usr/bin/plasmashell               98MiB |
|    0   N/A  N/A    144748      G   ...e/Steam/ubuntu12_32/steam       34MiB |
|    0   N/A  N/A    144766      G   ./steamwebhelper                    1MiB |
|    0   N/A  N/A    181074      G   ...AAAAAAAAA= --shared-files       94MiB |
|    0   N/A  N/A    224158    C+G   ...ntu12_64/fossilize_replay       12MiB |
|    0   N/A  N/A    232668    C+G   ...ntu12_64/fossilize_replay       12MiB |
|    0   N/A  N/A    236700    C+G   ...ntu12_64/fossilize_replay       12MiB |
|    0   N/A  N/A    239334    C+G   ...ntu12_64/fossilize_replay       12MiB |
|    0   N/A  N/A    240240    C+G   ...ntu12_64/fossilize_replay       12MiB |
|    0   N/A  N/A    240750    C+G   ...ntu12_64/fossilize_replay       12MiB |
+-----------------------------------------------------------------------------+

kakra commented 4 years ago

Also, for the Path of Exile shaders, this process takes multiple hours on my system, with a lot of IO. The shader caches are around 26GB for this game, with around 50% sitting in the foz directories, and another 50% in the nvidia directories, with hundreds of small files. After it finished, the shader caches shrink to 13GB total (with the 2/3 in the nvidia directories). So there's a lot of IO overhead which hogs the Linux page cache during that period. Maybe fossilize should use IO hinting when opening files to tell the Linux cache that the data is used only once and can be discarded first.

I'm on btrfs, and creating lots of small files stalls the file system for small periods of time. This itself is probably something that fossilize cannot do much about, and I could symlink it to another file system. But it's also dominating the cache as cgroup memory accounting clearly shows with accounting for over 21 GB of memory (although the processes itself only used around 8 GB).

Here's a snapshot of atop during that time:

And another one for psize memory at the same time while cgroups accounted around 21 GB of memory (which includes cache memory):

HansKristian-Work commented 4 years ago

Another question is: Why doesn't fossilize_replay stop when I start playing a game?

That sounds like a Steam bug. At least that's not something Fossilize has control over on its own.

Modern distro kernels usually use auto-group CPU scheduling, especially when tailored for desktops. This makes sense, as CPU shares are distributed evenly among groups of processes instead of single processes

Is there some Linux mechanism we can use to make the entire Fossilize process tree have lowest priority then? For reference, fossilize_replay top process is its own pgroup.

When using schedutil -B $(pgrep fossilize) to put the replayer into batch scheduling mode, it utilizes my CPU better (99.9 to 100% per core instead of 91-98%), and it results in better IO rates writing the data back.

That sounds like a good general change if it's shown to help. Fossilize is not an interactive application after all.

This is a snapshot from this night, apparently the process is done now so I can no longer get more details.

I think some GPU memory usage is expected since we are creating pipelines, and I assume the driver copies pipelines to GPU VRAM where they could hypothetically be executed. The driver doesn't know that we're only replaying for the purpose of filling the driver cache.

kakra commented 4 years ago

Another question is: Why doesn't fossilize_replay stop when I start playing a game?

That sounds like a Steam bug. At least that's not something Fossilize has control over on its own.

Seems to work sometimes but then again sometimes it doesn't... I'll watch that, let's not bother with this here, I guess.

Modern distro kernels usually use auto-group CPU scheduling, especially when tailored for desktops. This makes sense, as CPU shares are distributed evenly among groups of processes instead of single processes

Is there some Linux mechanism we can use to make the entire Fossilize process tree have lowest priority then? For reference, fossilize_replay top process is its own pgroup.

Yes, it seems so: man sched #The autogroup feature says there's /proc/[pid]/autogroup if auto-grouping is active. One can simply write to the file to change the CPU bandwidth. I don't have the feature enabled here, so I don't see it. And auto-groups are usually only created on setsid() calls, so running a bash subshell probably does that, thus if you're calling fossilize through a bash helper, it may create a new autogroup.

But reading further, it also says that auto-grouping does not affect SCHED_BATCH processes - and when you look at my commits in #101, I've already implemented that. But I cannot confirm that currently as my current kernel does not support auto-grouping. Whatever it is, SCHED_BATCH is already a win because such processes won't preemt interactive tasks.

When using schedutil -B $(pgrep fossilize) to put the replayer into batch scheduling mode, it utilizes my CPU better (99.9 to 100% per core instead of 91-98%), and it results in better IO rates writing the data back.

That sounds like a good general change if it's shown to help. Fossilize is not an interactive application after all.

See commits in #101 where I added some details in the commit messages.

This is a snapshot from this night, apparently the process is done now so I can no longer get more details.

I think some GPU memory usage is expected since we are creating pipelines, and I assume the driver copies pipelines to GPU VRAM where they could hypothetically be executed. The driver doesn't know that we're only replaying for the purpose of filling the driver cache.

Yeah, I think that's expected. I'll watch this more closely next time. If it doesn't use GPU% and VRAM stays at a few MB, then I'm fine with it. Later NVIDIA drivers also added a feature that frees idle VRAM allocations after 5 minutes (if they have been mirrored from sysmem) which probably explains why I only see 12 MB now while it has been much more in the past. According to an NVIDIA tech, this feature was silently introduced some versions ago, it should be deployed in all current drivers. It has actually been done to reduce VRAM pressure of web browsers which just leave their huge pile garbage rendering artifacts in VRAM, so it's some sort of garbage collection.

HansKristian-Work commented 4 years ago

Maybe fossilize should use IO hinting when opening files to tell the Linux cache that the data is used only once and can be discarded first.

Is this the posix_fadvise stuff I was seeing in the PR?

HansKristian-Work commented 4 years ago

As for memory usage, I figured that the Linux kernel would share pages for the database files across all processes. Another option is perhaps to use mmap() once in the parent process and just share the read-only mappings with worker processes, but that doesn't seem like an ideal solution either on 32-bit. Perhaps stdio is not doing the right thing with read-only files so that we're actually getting lots of redundant copies of the same data in the disk cache?

kakra commented 4 years ago

Perhaps stdio is not doing the right thing with read-only files so that we're actually getting lots of redundant copies of the same data in the disk cache?

The page cache in Linux shares pages among processes based on inodes and devid. So it should be okay. But how's stdio involved here? I didn't quite get a complete picture of fossilize design yet.

Another option is perhaps to use mmap() once in the parent process...

I'd rather not use mmap as that introduces a hard to tame beast of memory/cache pressure and becomes very slow very fast on memory constrained systems. Instead, I'm going with telling the kernel to give up cache early for the database in my PR. As long as there's no memory pressure, it should do nothing. But if there is, it will discard the page cache of the database before swapping out other processes memory or flushing their cache. But it currently only tackles the reading code path, but I think I figured out a way we can also become more cache-friendly in the write path.

Also, mmap of very large files may have a negative effect on the TLB which will probably increase latency of memory allocations and virtual memory access of concurrent processes. At least this is what it looked like when I've looked at KDEs Baloo which created a 256GB mmap on a sparse file. It can become quite toxic to foreground performance and latency, and we don't want that when fossilize would become quite a heavy user of the maps.

Using mmaps would just turn the page cache over to swap-like behavior. While that has lower latency compared to standard reads and writes, it steals control from us over what is cached. And we really do not care about latency of reads and writes in fossilize, do we?

Is this the posix_fadvise stuff I was seeing in the PR?

Yes, it is.

I'm pretty sure there's something similar for Windows. Do we have Windows users complaining about this? IOW, is it worth the effort to also improve the Windows behavior?

kakra commented 4 years ago

@HansKristian-Work Okay, when running with --progress, it uses the code path I've patched and correctly sets batch scheduling and IO nice. Also, the cache hinting seems to work: htop no longer shows swap usage increasing while the processes are running and my system does not seem to stall. But I fear I need to set the NVIDIA shader cache directory for a proper reproducer.

I noticed that your setsid() patch doesn't seem to quite do what it is supposed to do - or this doesn't work as expected when running the CLI with --progress:

I'd expect the SID to match the PID of the main instance or the first child, but it matches the PID of my shell. Any clues?

Next steps: Let's add some heuristics to limit the write back cache of the processes, then maybe add a load limiter which keeps the loadavg below the amount of available cores so we do not overload the scheduler.

kakra commented 4 years ago

@HansKristian-Work After letting the patched version run its job on the Path of Exile share data over night, I'm seen around 3 GB of memory pushed to swap this morning. This may partly be due to the backup which was running in parallel. But that's evidence we still need to do something about the write-back behavior which is introduced by third-party apps (probably mesa, nvidia driver, etc). But since that will take a very different approach, I'll leave that for another PR.

Due to that, my running desktop were a little laggy this morning when I first clicked them until all relevant pages have been swapped in again. But that took only a few seconds and now it's fine again.

Final progress report:

Fossilize INFO: =================
Fossilize INFO:  Progress report:
Fossilize INFO:    Overall 3025044 / 3025044
Fossilize INFO:    Parsed graphics 965762 / 969317, failed 3555, cached 0
Fossilize INFO:    Parsed compute 1 / 1, failed 0, cached 0
Fossilize INFO:    Decompress modules 1064043 / 1086418, skipped 0, failed validation 10638, missing 11737
Fossilize INFO:    Compile graphics 948007 / 969317, skipped 21310, cached 0
Fossilize INFO:    Compile compute 1 / 1, skipped 0, cached 0
Fossilize INFO:    Clean crashes 0
Fossilize INFO:    Dirty crashes 0
Fossilize INFO: =================

I wonder if there's anything suspicious about these numbers... Is it okay to show those skipped, failed, and missing numbers? And why does "cached" show 0?

HansKristian-Work commented 4 years ago

Those numbers look fine to me.

HansKristian-Work commented 4 years ago

I noticed that your setsid() patch doesn't seem to quite do what it is supposed to do - or this doesn't work as expected when running the CLI with --progress:

Did you disable the inherit_process_group flag I pointed you to? If you didn't then the CLI will not mess with process/session groups.

It worked fine on my end at least.

kakra commented 4 years ago

Ah okay didn't do that... Then it's fine I think.

kakra commented 4 years ago

@HansKristian-Work

As for memory usage, I figured that the Linux kernel would share pages for the database files across all processes.

I've looked at the code and I think I know what you mean: For the threaded replayer, that is true: The database is loaded at process init, allocates memory for it, and then spawns the threads. All is good: Allocated pages are shared, filesystem reads share the cache between each thread.

But for the master/slave mode, this is different: The parent spawns the children (one per HW thread, 8 in my case), then each child allocates memory and loads the database. This is not optimal: While reads come from a shared cache (thus, only one process actually physically loads pages from disk, the others just get their copy from the cache), they still act on separate memory that receives the reads. This could be solved with mmap() but care needs to be taken then to control the cache: We'd need to use madvise() at several places to selectively tell the page cache that it's okay to discard that memory from cache again, otherwise we'd dominate the cache even more than we already do. Also, mmap() seems to increase TLB pressure but I'm not a kernel engineer and cannot prove that. Also, mmap() will probably ignore any IO priorities as data is paged into our address space via faults, and faults are highest priority for obvious reasons, leaving only the benefit of multiple processes sharing the same mapping to not fault again for the same data - which is quite useless as each child operates on a different slice, right?

I think it could be solved but I'm not sure if this could work with the current design (partly because it uses execv() instead of fork()):

Instead of reading the database after the fork of children, you could allocate the memory and read the data before. Fork will then duplicate the address space but share the memory pages. First write to a memory page would fault and duplicate it, so each child will have its own virtual copy, so the outcome is still the same but should use much less memory. But it looks like the design how the processes are initialized doesn't really allow breaking database load out to before the fork. It seems the order of operation is:

  +------+   /-------+ n +---------+   /----------+ n +----------------+
->| init |---| fork? |-->| load db |-->| threads? |-->| run processing |
  +------+   +-------/   +---------+   +----------/   +----------------+
                 | y          ^             | y           ^      |
                 v            |             v             |      |
             +----------+     |        +---------+        |      | exit
             |  spawn   | 8x  |        |  spawn  |   8x   |      | worker
             | children |-----+        | threads |--------+      | thread
             +----------+              +---------+               |
                                                                 |
                            exit                                 |
             +----------+   process    +---------+               |
      +------| wait for |<-------------|  join   |<--------------+
      |      | children |              | threads |
      |      +----------+              +---------+
      v
   return

"load db" should move in front of the fork decision (and use fork() instead of execv() but that's not a simple replacement).

kakra commented 4 years ago

I'd rather not use mmap as that introduces a hard to tame beast of memory/cache pressure and becomes very slow very fast on memory constrained systems. Instead, I'm going with telling the kernel to give up cache early for the database in my PR. As long as there's no memory pressure, it should do nothing.

After some more research, it looks like fadvise will discard cache memory immediately if the kernel can, or ignore the request. So "it should do nothing" may be wrong. But as long as the database isn't re-reading the same data over and over again, that's not a problem at all. I'd rather see some duplicate reads then fossilize dominating the cache and forcing other apps into swap.

kakra commented 4 years ago

Another question is: Why doesn't fossilize_replay stop when I start playing a game?

That sounds like a Steam bug. At least that's not something Fossilize has control over on its own.

It looks like Steam just wouldn't spawn another fossilize job if I run a game. But it will finish the current one which can take hours for some shader databases.

kakra commented 3 years ago

@HansKristian-Work Some observations:

I disabled shader pre-caching completely because it no longer seemed to do anything for some games when I start them: The progress dialog would appear and sit at 0% but I was seeing no fossilize process launching or running.
I later enabled the shader pre-caching again because it led to very stuttering gameplay in some games with very visible lag spikes. It looks like disabling the pre-caching completely wiped the cache because Steam re-downloaded pre-cache data for all games.
The pre-caching data downloaded now adds up to 14 GB instead of 21 GB (according to Steam settings).
Fossilize now regenerating all caches, and it has almost no system impact this time and silently runs in the background.

What does it tell us? It seems like the shader pre-caches were broken before? Because (a) they were bigger and (b) processing had a very high impact on the system.

Since the process is not done yet, I don't have any data on how this behaves after restarts or updates now. But just disabling the shader pre-caching and enabling it again seems to have done something.

kakra commented 3 years ago

Another observation that may be linked to the problem:

When the Steam client crashes, it doesn't stop the fossilize process and it stays running. If you start Steam again, you get two fossilize worker groups working on the same shader files. I don't think that is supposed to work very well. If Steam crashes again, I get a third copy of fossilize processes running on the same shader data again. It's running thrice now.

Fossilize should somehow detect if its parent process is still running. After all, it reports progress somewhere, it should see the pipe going down. I don't think that it's possible for Steam to catch the crash signal to manually kill children processes. I don't even know why Steam crashed multiple times in a row for me. It was probably caused by system overload due to misbehaving fossilize somehow, and having two or more copies of fossilize running didn't make it better. That's why I disabled shader pre-caching in the first place. And now everything seems to be fine again, see above.

kakra commented 3 years ago

Another question is: Why doesn't fossilize_replay stop when I start playing a game?

That sounds like a Steam bug. At least that's not something Fossilize has control over on its own.

Another weird observation: This seems to happen only when running Borderlands 3: It always wants to rebuild the shader pre-cache in foreground which can take hours. If I press the skip button, Steam starts some heavy IO, and then either it crashes or it starts the game. However, in both cases fossilize keeps processing in the background on 8 cores. Usually, if I click "skip", it should stop processing shaders and just start the game - and that seems to work for other games. Also interesting: If I look at htop later, I see background processing with fossilize has started with 2 threads despite the game still running. Killing those two threads will eventually only spawn another round of background processes for the next game. It looks like the Steam client completely ignores that this particular game is running. Whatever that is, it makes a really bad game experience as it hovers around 8-15 fps instead of 40-60. I think having the scheduling changes deployed could really help here (except that this behavior is probably a bug in Steam anyways).

... Trying a little more this looks like some issue with out-of-order execution or async spawning: Launching Borderlands 3 shows the "Processing shaders" dialog, and launching the fossilize processes lags around 1 minute behind sometimes. If I click "skip" while fossilize is not yet running, it will close the dialog after a few seconds and launch the game but fossilize will still be launched about 1 minute later in foreground processing mode (using all cores). If I wait until fossilize runs and then click "skip", it will successfully stop the fossilize processes, then launch the game.

Trying a few different games, I still see fossilize background processes spawning even though I'm currently running a game. And Steam is aware of it as it shows the "Force close" button in the library (I don't know the English text because I'm using the German version, the button that usually starts and quits games in the library). Still, Steam happily launches fossilize background processing - running only with 2 processes but still it makes fps very choppy in some games. I'm not sure if this is related: I only tried games using an intermediate launcher yet and I'm pretty sure this doesn't happen for games that launch the main exe directly.

It's probably worth to report a Steam bug about this? Does this go to steam-for-linux?

kakra commented 3 years ago

@HansKristian-Work Is there a way to make fossilize a library component similar to how Proton is such a component? That would enable users to switch to a development branch of fossilize.

kakra commented 3 years ago

@HansKristian-Work The Steam update news seems to suggest that the latest version of fossilize has been deployed. This also means: No more background processing on Nvidia currently. Can I force processing anyways somehow to test the changes?

kakra commented 3 years ago

As of https://github.com/ValveSoftware/Fossilize/commit/200b19c319e2872415d74b5d3479e1624d748bc6 I think this can be closed. I'll leave it open until I see it deployed to the Steam client distribution.

kakra commented 3 years ago

Resource sharing and fairness works excellent with latest Steam beta with updated fossilize, closing.