bees causes memory fragmentation and low cached memory

lilydjwg commented 1 year ago

When bees is running, I get a lot of free memory but low cached usage, and more swap usage, i.e.

You can see the change here:

Also, kcompactd is constantly running.

This makes pagecache hit rate lower. There are many more disk reads, making processes accessing disks slow or unresponsive. I have three disks, nvme0n1 is my system disk, sda is a data disk receiving some constant writes, sdb is the one bees running with (it started at 13:00 and before that it was deleting unneeded snaphosts). Both sda and sdb are spinning disks.

There are a lot of snapshots on sdb and I'm using --scan-mode=0. I've also observed the same issue on another machine running bees and without spinning disks.

kakra commented 1 year ago

This is probably the same as #257. I can observe this behavior since switching from LTS 5.19 to 6.1. It's not that bees occupies free memory, it's rather btrfs seems to create high memory fragmentation as can be observed in /proc/buddyinfo.

lilydjwg commented 1 year ago

I see those vmalloc errors on 6.3 too, but this one is running 6.1. I'll check /proc/buddyinfo (I didn't know about it).

Zygo commented 1 year ago

The vmalloc bug from #257 was backported to 6.1. The fix hasn't been backported yet, so 6.1 is currently broken (and so is 5.15).

One thing to try is reducing the size of the buffers for LOGICAL_INO by changing this line in include/crucible/fs.h:

            BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 16 * 1024 * 1024);

Try lower values for buf_size, e.g. 64 * 1024. This will drastically reduce the number of reflinks bees can handle to a single extent, but that size is beyond more than enough for most filesystems. Making the buffer smaller may also reduce the size of vmallocs which might be aggravating the kernel's memory manager.

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

Off the top of my head I don't know why the kernel would allocate a buffer here, since userspace is already providing a buffer, but this is off the top of my head--I haven't looked at the kernel code for this issue.

lilydjwg commented 1 year ago

This is the first time memory fragmentation caused issues for me. My memory is already very fragmented. I can see counts of lower orders of blocks change significantly.

I've applied the settings in https://github.com/Zygo/bees/issues/257#issuecomment-1564213680 and it seems better now (much less cache miss; I no longer see my postgres processes being in D state frequently and causing timeout everywhere).

Not sure what a lower buf_size has achieved. It seems that cached memory size no longer changes a lot.

10:40: bees starts 10:55: applied those kernel settings mentioned above 11:40: bees with 64 * 1024 buf_size starts

How many reflinks is "a single extent"? I should have a few dozens of reflinks for most files on this filesystem.

I'll skip newer 6.1 kernels and update to latest 6.3 to avoid those vmalloc messages. They seem to not cause visible harm but waste journald's disk space.

kakra commented 1 year ago

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

How do we know if the buffer is too small?

Zygo commented 1 year ago

If the number of references returned is equal to the maximum number for that buffer, then the buffer is probably too small (it might be exactly the right size, but that's unlikely enough not to matter). Then we can use a larger buffer and repeat the entire LOGICAL_INO operation. With ideal kernel behavior, we would just use the larger buffer all the time, as there's no way to restart or extend a LOGICAL_INO result.

A smaller buffer limits the total number of references that bees can create to a common block of data. Once it hits 10,000 or so, other parts of btrfs start getting slower, so it's arguably not worth creating that many references in any case. A 64K buffer holds 2730 references, a 16 MB buffer holds a little less than 700,000.

lilydjwg commented 1 year ago

A 64K buffer holds 2730 references

Thanks for the explanation. That's quite a lot! It's still possible that my filesystem could reach this limit but it's good enough for me.

kakra commented 1 year ago

So bees could dynamically increase that number by 64k up to 3 times until we pass around 10k references. So if it would get exactly 2730 refs, it would use a 128k buffer for future requests? Or maybe even increase in steps of 4k?

OTOH, the kernel should handle the allocations better.

Does the kernel write to that buffer and reads data back in? Then I could imagine why it does not write directly to the user-space buffer: It could be modified while the kernel is still working with it, and thus open attack vectors.

For now, I'll be shipping the Gentoo version of bees with 64k buffer.

Zygo commented 1 year ago

It would have to skip directly from the minimum size to the maximum size (which would initially be 64K and 16MB) when the minimum size fills up. It can keep track of the distribution of ref counts and adjust the minimum size until it is large enough for 99% of all requests. We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare. We can also do things like limit the number of threads that can use a large buffer, while allowing all threads to use a small one at the same time.

The kernel's risks are the same whether it writes the buffer as it goes, or does one giant memcpy at the end; however, there might be some issues with memory fault errors interrupting the kernel thread. Either way, there is a lot of room for improvement of LOGICAL_INO, starting from making it work incrementally, or providing a way for userspace to follow metadata tree page backrefs.

kakra commented 1 year ago

We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare.

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs? As far as I understand, that could provide some level of incremental work but care would need to be taken to prevent loops if no more progress is made and the same refs show up over and over again.

Zygo commented 1 year ago

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs?

It's possible, but it would only be productive on large extents. For those 4K blocks that appear a million times on a filesystem, it's better to simply ignore them when they have too many references, unless the dedupe is combined with a defrag operation to make the extents larger.

kakra commented 1 year ago

Okay but in theory, bigger extents tend to have fewer refs - statistically. So in the end, such a "resubmit" because of a small buffer would not happen anyways, and we tend to ignore the huge ref counts for small common extents anyways due to bad timing behavior. So yes, there's probably no point in resubmitting such extents, but there's also no point in having a huge buffer of 16M and 64k or 128k should work just fine. Unless the defrag thing happens...

So I conclude going with my 64k patch is just fine for the moment.

This leaves the patch problematic for filesystems with thousands of snapshots - and under that condition, btrfs doesn't behave very well anyways.

Zygo commented 1 year ago

I've been collecting some data on that. 64K is a little too small (it's very common to blow past 2730 refs with a year's worth of snapshots, which is why I made the patch to increase it in the kernel), but the ideal size seems to be smaller than 256K (once an extent has 10920 refs it takes a millisecond of CPU to process each one). Maybe go with a 128K buffer for subvol scans?

My extent tree scanner prototype does much better with snapshots, but it has its own LOGICAL_INO code (because IGNORE_OFFSETS is mandatory), so it can easily construct the object with a different buffer size. I suspect defrag will be able to do the same thing.

kakra commented 1 year ago

with a year's worth of snapshots

You mean: daily snapshots? hourly?

Zygo commented 1 year ago

Daily (365 snapshots) would leave only 7.49 references available for each extent in a 64K buffer, so there can be only 7 duplicates within the subvol before we hit the limit and can't dedupe further. Containers, build workspaces, even rootfs upgrade snapshots can all easily generate more than 7 duplicate refs after dedupe.

On build servers it can be the other way around: they average 350 or so references to each file, so only 7 or 8 snapshots can be deduplicated.

There is a log message for this:

 addr 0x46ce881fc000 refs 3112 beats previous record 2730

which will start appearing once the 64K limit has been exceeded.

kakra commented 1 year ago

Yeah okay I think I get how the numbers calculate. Thanks for the explanation.

Zygo commented 1 year ago

Note to self: LOGICAL_INO does report the total number of references when the buffer overflows, in btrfs_data_container::elem_missed (that plus elem_cnt = total number of refs). Unfortunately the kernel still does all the work to find each reference just to fill in that count, so it's not exactly efficient to e.g. call with a zero buffer size, then call again with a buffer exactly the right size for the number of refs found.

Zygo commented 1 year ago

Getting back to the actual issue for a moment: for the benefit of those who haven't clicked on it, the graph at https://github.com/Zygo/bees/issues/260#issuecomment-1627598058 does show a dramatic improvement in kernel behavior right after setting a smaller buf_size.

bees has used 16 MiB buf_size since 2018, with a more recent change at 3430f169983db0f3577c198e9ef8d99e66cac28e which kept the buffers between uses instead of allocating and deallocating 16 MiB for each ioctl call.

Maybe the second change (in v0.9.3) accidentally triggered some unfortunate kernel behavior, or maybe bees and the kernel have always behaved this way and nobody reported it before. v0.9.3 fixed a number of bees memory management problems which could easily have concealed an issue like this by making the cache page evictions look like "normal" bees process memory usage.

kakra commented 1 year ago

Okay, so I thought I didn't see that behavior with last years LTS kernel and I observed bad memory pressure after going to 6.1. But coincidentally, your change to bees in February falls into the same time period when I started to use kernel 6.1. So maybe I should try again with that change reverted?

OTOH, even when I stop bees, the kernel behaves bad when memory cgroups are enabled. So I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management. I tried with the new multigen LRU and without. Only turning off both memory cgroups AND transparent hugepages gets this under control for me. We are still facing issues with memory pressure when bees is running on our servers but it's mostly okay now with hugepages in madvise-only mode.

Zygo commented 1 year ago

So maybe I should try again with that change reverted?

Sounds like a good idea. If we get an effect that can be used for bisection, then we might be able to find where things went wrong.

I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management.

There's no rule that says kernel bugs have to appear one at a time. ;)

kakra commented 5 months ago

Following up from https://github.com/Zygo/bees/issues/268#issuecomment-2068025791:

... find out why btrfs is causing that high memory fragmentation. I feel like this has become worse and worse with each LTS kernel version. It's probably less an issue within bees but rather a side effect of btrfs doing some very unfavorable memory management in the kernel.

After my repeated issue in #268, it becomes very apparent that btrfs has some really bad memory fragmentation patterns.

As posted before, with cgroups enabled, the kernel almost has no chance of using physical memory anymore, swap usage increases a lot. Most of the memory stays completely free. cgroups steal cache memory a lot from each other and fight over a very tight memory resource. This happens within the first few hours of system operation.

With transparent huge pages, a very similar effect kicks in. This is similar to the previous issue but cache memory fluctuates a lot more. This happens with the first few hours of system operation.

Without cgroups and transparent huge pages, the effect takes longer to emerge, usually within a few days of system operation. But this is largely impacted by bees activity: If bees activity is high, the effect kicks within the first 1 or 2 hours of operation. If activity is low, the system can work for days before the effect kicks in and causes high desktop latency. This is different from the previous two observations which were largely based on low bees activity.

Zygo commented 5 months ago

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

kakra commented 5 months ago

Is BEES_MAX_EXTENT_REF_COUNT and env var, or is it a defined constant in the source code?

kakra commented 5 months ago

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

I've created a patch (9999 max ref count) and reverted my previous patch (64kB max memory, as suggested previously in Jul 2023), and running that for around 48 hours now shows really great results. I'll let it run for some more time to see how it behaves after different workloads applied to the system.

Zygo / bees

bees causes memory fragmentation and low cached memory #260