sysctl: Make the value of dirty pages optimal and for most configurat…

ventureoo commented 1 year ago

…ions

Do not touch this until it has been tested.

The main motivation is that I noticed, on my laptop with 16Gb RAM and NVMe, that the dirty pages when simultaneously downloading in Steam and Qbittorrent did not exceed more than one and a half gigabytes. That said, the values used in the ratio are often excessive, and for desktop tasks you just can not have a situation, dirty pages you have more than 2 gigabytes. Because of this, and to avoid problems with ratio value selection for each individual configuration, I suggest replacing ratio by bytes. Otherwise, it requires testing.

ventureoo commented 1 year ago

The value of vm.drity_bytes is reduced to a gigabyte

Boria138 commented 1 year ago

You have a typo 1048576 bytes is one megabyte, not a gigabyte. And yet, do you think the value of 4194304 bytes indicated on the Arch Wiki is not optimal ?

ventureoo commented 1 year ago

You have a typo 1048576 bytes is one megabyte, not a gigabyte.

Thanks for pointing that out. Sometimes my inattentiveness drives me crazy. Anyway, this PR looks like it will never be merged, because these parameters are extremely hardware specific, and picking the optimal values can be quite a difficult thing to do.

And yet, do you think the value of 4194304 bytes indicated on the Arch Wiki is not optimal ?

I have not found any information confirming their positive effect on performance. There are no references that would confirm the fixing of 'freezes', and the section itself was added as early as 2011, which makes me doubt the relevance of such values now.

https://wiki.archlinux.org/index.php?title=Sysctl&diff=prev&oldid=170524

Boria138 commented 1 year ago

I'll go off topic a bit and point out a small error since kernel version 5.6 the net.core.somaxconn parameter is 4096 instead of 128

ventureoo commented 1 year ago

I'll go off topic a bit and point out a small error since kernel version 5.6 the net.core.somaxconn parameter is 4096 instead of 128

Fixed, thanks.

Boria138 commented 1 year ago

Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer

ptr1337 commented 1 year ago

Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer

Generally a good idea. How about requesting this first at bpftune and if they reject the request implentnig it from our side with a bash script?

Boria138 commented 1 year ago

Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer

Generally a good idea. How about requesting this first at bpftune and if they reject the request implentnig it from our side with a bash script?

First we need @ventureoo to find the optimal values, I have an example script that I used before but gave up because I couldn't find the values


ram_size=$(free -g | grep mem | awk '{print $2}')

# If 4G of installed RAM
if [ "$ram_size" -le 3 ]; then
    sysctl -w vm.dirty_background_ratio=5
    sysctl -w vm.dirty_ratio=10

# If 8G of installed RAM
elif [ "$ram_size" -le 7 ]; then
    sysctl -w vm.dirty_background_ratio=4
    sysctl -w vm.dirty_ratio=8

# If 16G of installed RAM
elif [ "$ram_size" -le 15 ]; then
    sysctl -w vm.dirty_background_ratio=2
    sysctl -w vm.dirty_ratio=4
fi

Boria138 commented 1 year ago

this is dirty_ratio, not dirty_bytes, but this is just a sample implementation

ventureoo commented 1 year ago

I set the value in vm.dirty_bytes to 256 megabytes, as this seems to be the optimum minimum that doesn't cause issues with CoW-based file systems like Btrfs (https://github.com/pop-os/default-settings/issues/111).

ahydronous commented 2 weeks ago

Here is a script that does it intelligently: https://gitlab.com/cscs/maxperfwiz/-/blob/master/maxperfwiz?ref_type=heads

ptr1337 commented 2 weeks ago

@ventureoo Can you PTAL?

ahydronous commented 2 weeks ago

I mentioned it with the PopOS people too https://github.com/pop-os/default-settings/issues/111 too, so perhaps it would be interesting to work together on it.

I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.

For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want vm.swappiness to 10-40 even on zram systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/

Ultimately you'd want the vm settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.

ptr1337 commented 2 weeks ago

Cool, thank you for bringing additional notes. Currently, we use a quite high swap value, if zram is used, since swapiness works differently at zram (basically you want to have stuff moved to zram, so that it acts more as swap).

When ventuero is available, he will look into those and also if its possible to integrate this in a nice way to CachyOS. Feel free to join the discord too, for more easy discussions about this topic.

ventureoo commented 2 weeks ago

I mentioned it with the PopOS people too pop-os/default-settings#111 too, so perhaps it would be interesting to work together on it.

I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.

For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want vm.swappiness to 10-40 even on zram systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/

Ultimately you'd want the vm settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.

The issue is that you shouldn't use dirty_ratio in general. This percentage is not taken from the total memory, but from the free memory. Because of this, you always get massive trashing cases when memory utilization is high, because the number of available dirty pages is not fixed. Even if you set vm.dirty_ratio = 1 or 2 for configurations with a lot of memory, this will be a problem when memory is clogged, because the amount of dirty pages when memory is clogged will be very small => lots of I/O blocks. That's why I've always favored a fixed amount of dirty pages. I believe that the current situation can be improved not through a dependency on memory size, but by redefining the amount of dirty pages from a disk-specific total. Which can already be achieved via max_bytes and min_bytes for individual devices (https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-class-bdi). This is important because the larger amount of dirty pages, the larger the amount data that will actually be written to disk, and until this writing passes the I/O lock will not be released. If the media is very slow, and we have 256MB as vm.dirty_ratio, this can be a problem, as the lock may not be released for a long time. I'm thinking of adding this to the current udev rules for setting I/O schedulers.

ventureoo commented 2 weeks ago

I mentioned it with the PopOS people too pop-os/default-settings#111 too, so perhaps it would be interesting to work together on it.

I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.

For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want vm.swappiness to 10-40 even on zram systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/

Ultimately you'd want the vm settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.

The value simply represents the kernel's tendency to swap out anonymous memory pages relative to other pages, such as file ones. Since we've established that most of your working set, ( memory needed for running applications) is comprised of anonymous memory pages, it's counterproductive for gaming performance to tell the kernel to prioritize swapping those out in favor of keeping your file pages untouched. Not to mention that you use THP, which means that in order to maximize gaming performance, there needs to be an abundance of hugepages which will reduce TLP misses and therefore boost the performance of your game. So you don't want those to be swapped out as they will hurt the performance of your games as said before. Because of this, it's best to actually reduce the swappiness, even while using ZRAM/zswap.

This statement is wrong for one reason. That reason is MGLRU. The kernel does not try to push everything out of memory into swap until memory is at least 90-95% full. Your game can “sleep well” as long as you don't completely fill memory. Even so, the working set will be preserved, because as I said, MGLRU has page thrashing protection (/sys/kernel/mm/lru_gen/min_ttl_ms) and we enable it by default in our kernel.

ahydronous commented 1 week ago

But wouldn't you still rather swap out file pages over anonymous pages? That is what vm.swappiness ultimately controls, not how likely the kernel is to swap in general. Although I guess vfs_cache_pressure also gets into the mix.

ventureoo commented 1 week ago

But wouldn't you still rather swap out file pages over anonymous pages? That is what vm.swappiness ultimately controls, not how likely the kernel is to swap in general.

This is what is explained in my PR: https://github.com/CachyOS/CachyOS-Settings/pull/19

ventureoo commented 1 week ago

My point is that if repeated reads from disk can be avoided - then it should be done, because reading from RAM will always be faster than reading from disk. When we talk about preferring page (file) cache preemptions instead of anonymous pages, we're not talking about putting them in swap, but simply flushing them from RAM. This is fine because we can always read from again from disk, but it also becomes a bottleneck because it potentially increases page cache misses, and the resulting increase in I/O latency, in low memory conditions. File pages are not just about regular files, they are also data from your browser's various caches, Mesa's shader cache, executables at end - this is not something you want to flush. In the case of ZRAM, if we flush anonymous pages instead of file pages, we're just compressing them in memory, and it will cost us a lot less to decompress them inside RAM (especially if we're using lz4) than it would to re-read them from disk. At the same time file pages are flushed less often and hence has fewer misses => fewer re-reads from disk.

ahydronous commented 1 week ago

In the case of ZRAM, if we flush anonymous pages instead of file pages, we're just compressing them in memory, and it will cost us a lot less to decompress them inside RAM (especially if we're using lz4) than it would to re-read them from disk. At the same time file pages are flushed less often and hence has fewer misses => fewer re-reads from disk.

But like the Reddit comment points out

Because most of your active working set is anonymous memory mappings (if you check /proc/meminfo it's often 5 or 6:1 relative to file mappings, it can go higher if you have a game running), and those are the ones having huge pages, since you do have swap enabled, what will happen is that the hugepages will literally not be reduced to normal size during swapping. This conflicts with ZRAM/zswap, because it means more CPU time will be needed to compress the page when it's swapped, which ruins your game process.

So, with anonymous huge pages being a poor fit for compression, wouldn't you want to keep those in memory and out of swap, vs file pages, which work better (?) with compression, especially with lz4 speeds guaranteeing quick access times.

Basically, how I understand the priority is that huge anonymous pages are a good fit for working memory but do poorly on memory swap, then file cache is a good fit for working memory and memory swap, but you'd rather flush a huge anonymous page than flush a file cache.

Ultimately this is a balancing act probably, and gaming workloads are often somewhat unique compared to ordinary workloads.

ventureoo commented 1 week ago

vs file pages, which work better (?) with compression, especially with lz4 speeds guaranteeing quick access times.

I'm not quite sure what you mean by file page compression. As far as I know it's impossible. Only anonymous pages always go into swap, file pages are simply removed from memory - this is what we call flushing.

It would be nice, to see benchmarks, because at the moment I think the THP and swap issue is a bit overblown. ZRAM's current compression/decompression speeds are pretty great (even based on the outdated benchmarks (https://libreddit.kavin.rocks/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/), and I think they will work fine even for huge page sizes, not to mention that we can use an auxiliary algorithm to recompress huge pages, which we already use with lz4.

ventureoo commented 1 week ago

I'd also like to point out that THP itself is a problem for games if your memory is clogged, as when THP is active it's always trying to compact memory and make small pages into huge ones, which can serve to latency spikes. But then again, if your memory is not at least 90-95% full, then neither THP nor ZRAM is a problem, because in this case swap is not actively used and there is no page thrashing.

CachyOS / CachyOS-Settings

sysctl: Make the value of dirty pages optimal and for most configurat… #22