Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
625 stars 56 forks source link

munmap when not actively scanning #253

Open KyleSanderson opened 1 year ago

KyleSanderson commented 1 year ago

hi - thank you for this tool. It seems way more accurate than duperemove.

When using the load average feature [^1], it looks like memory is never released despite the agents sleeping[^2] for an elongated period of time.

It would be nice to release this memory back to the system when it's not needed as bees appears to be completely idle sitting on rt_sigtimedwait. Besides page cache purposes, it's just helpful to have more memory available for actually running applications.

[^1]: Before setting it my box became a bit hilarious at a load of 490 on an 8 thread system. [^2]: I have 10 of them with the default thread count.

Zygo commented 1 year ago

Releasing the memory is usually a very IO intensive operation, as the entire hash table has to be flushed out to disk in order to release the memory. It's also memory-intensive as it's essentially dumping the entire hash table into the page cache as dirty pages which may have to be flushed (depending on vm.dirty_bytes and other kernel vm parameters) before processes can allocate memory again. Adding a big write-intensive and memory-intensive workload seems like the opposite of what we want to do at the moment when the threads have been forced to idle because the rest of the system is already under pressure.

You can get a preview of what this feature might look like by running a script to terminate and start the bees processes based on load, instead of the built-in loadavg feature. Unmapping the hash table is 99.9% of the work of a full bees shutdown, so the extra overhead of a complete shutdown and restart is negligible. If you get something concrete working, and it somehow doesn't suck, we might turn it into a built-in feature.

There might also be a possible middle ground, where the hash table writeback notices there are no worker threads, and unmaps clean pages after writing them out whenever they add up to a hugepage. When the workers wake up again, they can read in the hash table on demand. That spreads out the IO load over the time the worker threads are idle (since the writeback thread isn't limited by loadavg, this doesn't change the IO load profile), and the big read operation to reload the hash table happens at the point where the rest of the system is idle. Even this might cause undesirable load spikes because of the need to defragment memory to get hugepages back.

KyleSanderson commented 1 year ago

On my box[^1], the system load rises because there is memory pressure (random writes), where the page cache can't be used as much as one would like (eventually the entire file is committed, which gives a full sequential write). Each volume gives me around 200MB/s sequential writes, which would take a total of 5 seconds to complete if ibees was rewriting the entire file. Even on a DM SMR I would still expect this to complete in 10s or less. When bees is sitting idle for minutes, to hours, because nothing is happening on the disks, it is very detrimental to the entire box.

With that being said, unfortunately I've hit a multitude of btrfs bugs and design flaws (not from bees), so I'm in the process of moving them all back to xfs and chalking this up to another rotten experiment. bees is excellent and will be missed from my returning xfs box. It's a thankless job and I do genuinely appreciate this tool greatly.

[^1]: (12 disks, 18T volumes, 1G dedup files, with 32G of total memory).