Closed ventureoo closed 1 year ago
The value of vm.drity_bytes is reduced to a gigabyte
You have a typo 1048576 bytes is one megabyte, not a gigabyte. And yet, do you think the value of 4194304 bytes indicated on the Arch Wiki is not optimal ?
You have a typo 1048576 bytes is one megabyte, not a gigabyte.
Thanks for pointing that out. Sometimes my inattentiveness drives me crazy. Anyway, this PR looks like it will never be merged, because these parameters are extremely hardware specific, and picking the optimal values can be quite a difficult thing to do.
And yet, do you think the value of 4194304 bytes indicated on the Arch Wiki is not optimal ?
I have not found any information confirming their positive effect on performance. There are no references that would confirm the fixing of 'freezes', and the section itself was added as early as 2011, which makes me doubt the relevance of such values now.
https://wiki.archlinux.org/index.php?title=Sysctl&diff=prev&oldid=170524
I'll go off topic a bit and point out a small error since kernel version 5.6 the net.core.somaxconn parameter is 4096 instead of 128
I'll go off topic a bit and point out a small error since kernel version 5.6 the net.core.somaxconn parameter is 4096 instead of 128
Fixed, thanks.
Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer
Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer
Generally a good idea. How about requesting this first at bpftune and if they reject the request implentnig it from our side with a bash script?
Listen, how about the idea to write a script that will set these values depending on the amount of RAM for example 4,8 and 16 GB and of course a daemon to it because the optimal values of these settings probably do not exist, so it is probably more correct to set for each individually with the help of a script and daemon or timer
Generally a good idea. How about requesting this first at bpftune and if they reject the request implentnig it from our side with a bash script?
First we need @ventureoo to find the optimal values, I have an example script that I used before but gave up because I couldn't find the values
ram_size=$(free -g | grep mem | awk '{print $2}')
# If 4G of installed RAM
if [ "$ram_size" -le 3 ]; then
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10
# If 8G of installed RAM
elif [ "$ram_size" -le 7 ]; then
sysctl -w vm.dirty_background_ratio=4
sysctl -w vm.dirty_ratio=8
# If 16G of installed RAM
elif [ "$ram_size" -le 15 ]; then
sysctl -w vm.dirty_background_ratio=2
sysctl -w vm.dirty_ratio=4
fi
this is dirty_ratio, not dirty_bytes, but this is just a sample implementation
I set the value in vm.dirty_bytes
to 256 megabytes, as this seems to be the optimum minimum that doesn't cause issues with CoW-based file systems like Btrfs (https://github.com/pop-os/default-settings/issues/111).
Here is a script that does it intelligently: https://gitlab.com/cscs/maxperfwiz/-/blob/master/maxperfwiz?ref_type=heads
@ventureoo Can you PTAL?
I mentioned it with the PopOS people too https://github.com/pop-os/default-settings/issues/111 too, so perhaps it would be interesting to work together on it.
I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.
For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want vm.swappiness
to 10-40
even on zram
systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/
Ultimately you'd want the vm
settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.
Cool, thank you for bringing additional notes. Currently, we use a quite high swap value, if zram is used, since swapiness works differently at zram (basically you want to have stuff moved to zram, so that it acts more as swap).
When ventuero is available, he will look into those and also if its possible to integrate this in a nice way to CachyOS. Feel free to join the discord too, for more easy discussions about this topic.
I mentioned it with the PopOS people too pop-os/default-settings#111 too, so perhaps it would be interesting to work together on it.
I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.
For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want
vm.swappiness
to10-40
even onzram
systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/Ultimately you'd want the
vm
settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.
The issue is that you shouldn't use dirty_ratio
in general. This percentage is not taken from the total memory, but from the free memory. Because of this, you always get massive trashing cases when memory utilization is high, because the number of available dirty pages is not fixed. Even if you set vm.dirty_ratio = 1
or 2
for configurations with a lot of memory, this will be a problem when memory is clogged, because the amount of dirty pages when memory is clogged will be very small => lots of I/O blocks. That's why I've always favored a fixed amount of dirty pages. I believe that the current situation can be improved not through a dependency on memory size, but by redefining the amount of dirty pages from a disk-specific total. Which can already be achieved via max_bytes and min_bytes for individual devices (https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-class-bdi). This is important because the larger amount of dirty pages, the larger the amount data that will actually be written to disk, and until this writing passes the I/O lock will not be released. If the media is very slow, and we have 256MB as vm.dirty_ratio
, this can be a problem, as the lock may not be released for a long time. I'm thinking of adding this to the current udev rules for setting I/O schedulers.
I mentioned it with the PopOS people too pop-os/default-settings#111 too, so perhaps it would be interesting to work together on it.
I assume the MaxPerfWizard people did some basic profiling given that they have ideal values noted down, but ultimately, yeah profiling is just needed for all the Virtual Memory subsystem stuff.
For example here is a person that did some profiling and realized that unlike what is considered common knowledge, for gaming workloads you want
vm.swappiness
to10-40
even onzram
systems, due to the way Transparent Huge Pages and memory working sets function: https://www.reddit.com/r/linux_gaming/comments/vla9gd/comment/ie1cnrh/Ultimately you'd want the
vm
settings to be dynamic (think https://github.com/VR-25/zram-swap-manager but for other settings too), but this is much broader than CachyOS and would have to ideally be somewhere upstream, either at the kernel or perhaps systemd.The value simply represents the kernel's tendency to swap out anonymous memory pages relative to other pages, such as file ones. Since we've established that most of your working set, ( memory needed for running applications) is comprised of anonymous memory pages, it's counterproductive for gaming performance to tell the kernel to prioritize swapping those out in favor of keeping your file pages untouched. Not to mention that you use THP, which means that in order to maximize gaming performance, there needs to be an abundance of hugepages which will reduce TLP misses and therefore boost the performance of your game. So you don't want those to be swapped out as they will hurt the performance of your games as said before. Because of this, it's best to actually reduce the swappiness, even while using ZRAM/zswap.
This statement is wrong for one reason. That reason is MGLRU. The kernel does not try to push everything out of memory into swap until memory is at least 90-95% full. Your game can “sleep well” as long as you don't completely fill memory. Even so, the working set will be preserved, because as I said, MGLRU has page thrashing protection (/sys/kernel/mm/lru_gen/min_ttl_ms
) and we enable it by default in our kernel.
But wouldn't you still rather swap out file pages over anonymous pages? That is what vm.swappiness
ultimately controls, not how likely the kernel is to swap in general.
Although I guess vfs_cache_pressure
also gets into the mix.
But wouldn't you still rather swap out file pages over anonymous pages? That is what
vm.swappiness
ultimately controls, not how likely the kernel is to swap in general.
This is what is explained in my PR: https://github.com/CachyOS/CachyOS-Settings/pull/19
My point is that if repeated reads from disk can be avoided - then it should be done, because reading from RAM will always be faster than reading from disk. When we talk about preferring page (file) cache preemptions instead of anonymous pages, we're not talking about putting them in swap, but simply flushing them from RAM. This is fine because we can always read from again from disk, but it also becomes a bottleneck because it potentially increases page cache misses, and the resulting increase in I/O latency, in low memory conditions. File pages are not just about regular files, they are also data from your browser's various caches, Mesa's shader cache, executables at end - this is not something you want to flush. In the case of ZRAM, if we flush anonymous pages instead of file pages, we're just compressing them in memory, and it will cost us a lot less to decompress them inside RAM (especially if we're using lz4) than it would to re-read them from disk. At the same time file pages are flushed less often and hence has fewer misses => fewer re-reads from disk.
In the case of ZRAM, if we flush anonymous pages instead of file pages, we're just compressing them in memory, and it will cost us a lot less to decompress them inside RAM (especially if we're using lz4) than it would to re-read them from disk. At the same time file pages are flushed less often and hence has fewer misses => fewer re-reads from disk.
But like the Reddit comment points out
Because most of your active working set is anonymous memory mappings (if you check /proc/meminfo it's often 5 or 6:1 relative to file mappings, it can go higher if you have a game running), and those are the ones having huge pages, since you do have swap enabled, what will happen is that the hugepages will literally not be reduced to normal size during swapping. This conflicts with ZRAM/zswap, because it means more CPU time will be needed to compress the page when it's swapped, which ruins your game process.
So, with anonymous huge pages being a poor fit for compression, wouldn't you want to keep those in memory and out of swap, vs file pages, which work better (?) with compression, especially with lz4
speeds guaranteeing quick access times.
Basically, how I understand the priority is that huge anonymous pages are a good fit for working memory but do poorly on memory swap, then file cache is a good fit for working memory and memory swap, but you'd rather flush a huge anonymous page than flush a file cache.
Ultimately this is a balancing act probably, and gaming workloads are often somewhat unique compared to ordinary workloads.
vs file pages, which work better (?) with compression, especially with
lz4
speeds guaranteeing quick access times.
I'm not quite sure what you mean by file page compression. As far as I know it's impossible. Only anonymous pages always go into swap, file pages are simply removed from memory - this is what we call flushing.
It would be nice, to see benchmarks, because at the moment I think the THP and swap issue is a bit overblown. ZRAM's current compression/decompression speeds are pretty great (even based on the outdated benchmarks (https://libreddit.kavin.rocks/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/), and I think they will work fine even for huge page sizes, not to mention that we can use an auxiliary algorithm to recompress huge pages, which we already use with lz4.
I'd also like to point out that THP itself is a problem for games if your memory is clogged, as when THP is active it's always trying to compact memory and make small pages into huge ones, which can serve to latency spikes. But then again, if your memory is not at least 90-95% full, then neither THP nor ZRAM is a problem, because in this case swap is not actively used and there is no page thrashing.
…ions
Do not touch this until it has been tested.
The main motivation is that I noticed, on my laptop with 16Gb RAM and NVMe, that the dirty pages when simultaneously downloading in Steam and Qbittorrent did not exceed more than one and a half gigabytes. That said, the values used in the ratio are often excessive, and for desktop tasks you just can not have a situation, dirty pages you have more than 2 gigabytes. Because of this, and to avoid problems with ratio value selection for each individual configuration, I suggest replacing ratio by bytes. Otherwise, it requires testing.