Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

How best to run on a system with multiple btrfs pools #153

Open ubenmackin opened 3 years ago

ubenmackin commented 3 years ago

On my current NAS, I have four different btrfs filesystems (pools? not sure the nomenclature), each has a unique UUID.

I'm running all of this on a Ryzen 3400G with 48 GB of RAM. I am currently running bees on the Single drive to see how it works, and because that data is expendable (as it is a secondary backup).

All of that said, I am wondering how best to configure any options to run bees across my whole setup.

I think I will want to start four systemd services (one per UUID). But my question is will bees be smart enough to not try and run too many cpu threads at once and stomp on each other? Should I only run 1 at a time during the initial "bootstrap", and then let all three run together during normal duty?

If there is a better forum/place to ask questions like this (as it isn't really an issue), please let me know.

kakra commented 3 years ago

I'd define a pool as a bunch of disks/partitions belonging to the same btrfs - so that nomenclature seems correct to me. ;-)

Bees once had support for running on multiple btrfs in parallel, probably even with one hash file - I'm not sure. It's still more or less baked into the code I think, but that code-path is probably unmaintained and untested, the feature is deprecated. You should run multiple instances of bees, one per FS. You can use the load limiter on bees so it auto-tunes itself. I'm usually going with a value of 5 for desktops, on servers a little higher - depending on the expected load during peak usage time, then just keep the limiter a little below that value, so it won't run during peak usage. Bees is very sensitive to load changes thanks to its built-in load detected, it stops activity within seconds (so it detects high load before it even shows up as load in top).

ubenmackin commented 3 years ago

Ok, awesome.

I'm going to let this first pass finish on my single disk setup. Mainly I was curious to see how much benefit I'd get out of dedupe. Once done, I'll setup the systemd services one for each pool. Then I'll let it fly and see how things go!

Zygo commented 3 years ago

There were some problems with sharing a hash table, which is why the idea got dropped. The fatal flaw is the way that new data evicts old data from the hash table. The hash table implements (roughly) a sliding window over the last N bytes of data read, so new data simply pushes old data out (with some adjustments to preserve data that is referenced many times). If two different filesystems share the hash table, the larger or more active one floods the hash table space, and the smaller or less active filesystems get much poorer dedupe efficiency than they would if smaller, separate hash tables were used--even if the data:hash table ratio is higher in the separate tables.

Currently the best approach for multiple filesystems is to run multiple bees processes. Each filesystem should have a hash table sized proportionally to the filesystem. e.g. if you use a 5000:1 data:hash table ratio, you would put a 60MB hash table on the 300GB filesystem, 2.4GB hash table on 21TB filesystem, etc. If none of the filesystems share a rotating disk then there's no reason not to run bees on all of them at the same time.

There is a dynamic load governor feature built into bees, but the algorithm assumes no other process is running the same algorithm on the host, and it does not coordinate the activities of separate bees processes. If you run too many instances of bees on the same host, it can end up in a feedback loop where changes in the number of threads are multiplied by the number of bees processes. This can lead to undesirable load swings as the processes all add too many worker threads when load is low, and remove too many worker threads when load is high. (This interaction also happens when any other dynamic load governor is present on the system, e.g. the --max-load option in GNU make.)

With this mix of filesystem sizes it's probably OK to use the automatic load manager and default thread limit settings. The smaller filesystems will be scanned quickly and will not contribute much to load (and therefore won't confuse the load management algorithm) once the initial scan is done. Ideally, bees is idle much of the time.

If load problems do occur, or you just want bees to use less than maximum CPU and IO bandwidth, you can reduce the maximum number of threads used in each bees process with -c or -C. -C takes a ratio, you can give it 0.25 if you intend to run 4 bees processes (0.25 = 1.0 / 4). -c takes a simple number, and bees runs that number of threads in that process. You might allocate those proportionally as well: smaller filesystems with fewer disks get fewer threads, larger filesystems with more disks get more threads.

If we ever had a reason to restore support for multiple filesystems in a single bees process some day, it would be to put them all under the control of a single load governor instance. The filesystems would all have separate BeesContext objects, so everything would be the same as it is now (separate hash tables, separate crawl pointers), the difference would be that all contexts would share a single worker thread pool dynamically sized according to load. There are other--possibly better--ways to implement that feature, e.g. put the load governor into a separate process and make all bees processes clients of that process.

ubenmackin commented 3 years ago

There were some problems with sharing a hash table, which is why the idea got dropped.

Maybe not a valid reason, but I'd assume you can't dedupe across filesystems. So having a shared hash table doesn't seem ideal, as potentially you'd have items hash to the same value across file systems, and not really be able to do anything with it. I think keeping them per filesystem makes the most sense.

Currently the best approach for multiple filesystems is to run multiple bees processes. Each filesystem should have a hash table sized proportionally to the filesystem. e.g. if you use a 5000:1 data:hash table ratio, you would put a 60MB hash table on the 300GB filesystem, 2.4GB hash table on 21TB filesystem, etc. If none of the filesystems share a rotating disk then there's no reason not to run bees on all of them at the same time.

Yup that's my situation. All disk are assigned to a pool, no disk is assigned to more than one.

In terms of specifying the hash table size, where is the best place to do that? If I make use of the systemd service, is there something I should add/edit in the unit file to specify the hash table size? I saw a sample conf file in /etc/bees. What is the mechanism that beesd, when called from the systemd unit, knows which conf file to use for which filesystem?

With this mix of filesystem sizes it's probably OK to use the automatic load manager and default thread limit settings. The smaller filesystems will be scanned quickly and will not contribute much to load (and therefore won't confuse the load management algorithm) once the initial scan is done. Ideally, bees is idle much of the time.

I think what I might do, to get through the initial load, is run bees one at a time starting on my smallest filesystem. Then when the first finishes, I'll start the next service. And so on, until they are all running.

Most of the pools see very little activity/growth. The SSDs, which is the datastore for VMs probably would see the most turn over. Next would be the backups, which run nightly. But the largest pool is relatively static, as it is just my media so mostly reads, not a lot of writes.

If we ever had a reason to restore support for multiple filesystems in a single bees process some day, it would be to put them all under the control of a single load governor instance. The filesystems would all have separate BeesContext objects, so everything would be the same as it is now (separate hash tables, separate crawl pointers), the difference would be that all contexts would share a single worker thread pool dynamically sized according to load. There are other--possibly better--ways to implement that feature, e.g. put the load governor into a separate process and make all bees processes clients of that process.

That sounds like a good idea.

Zygo commented 3 years ago

I'd assume you can't dedupe across filesystems. So having a shared hash table doesn't seem ideal, as potentially you'd have items hash to the same value across file systems, and not really be able to do anything with it.

Hash collisions almost never happen, but copies of identical data on separate filesystems do. That can be solved by combining the filesystem UUID with the hashed data so that different filesystems hash the same data to different values. The idea was that hash space would be shared as a single pool and dynamically allocated to each filesystem as needed; however, that didn't play well with sliding window, and all the solutions to that problem look like either "allocate multiple non-overlapping hash tables" or "spend more than the minimum number of bits on a hash table entry." I'm still looking for ways to make hash table entries smaller.

In terms of specifying the hash table size, where is the best place to do that?

If you are using the beesd script, it takes a DB_SIZE parameter in /etc/bees/<uuid>.conf for each filesystem. You can also use the bare bees binary, but bees will not create the hash table itself, so you must create it in .beeshome/beeshash.dat and set the desired size with truncate -s. It must be a multiple of 128K.

ubenmackin commented 3 years ago

@Zygo sorry to ping you, but one more question:

if you use a 5000:1 data:hash table ratio, you would put a 60MB hash table on the 300GB filesystem, 2.4GB hash table on 21TB filesystem, etc.

Quick question, this is based on uncompressed data size or compressed data size? I ask because the compressed data size (using btrfs compression setting of ztd:3), it is ~1.6 TB. Uncompressed it is 5.7 TB. So for hash file size, not sure which I should use for the estimate.

Zygo commented 3 years ago

It comes down to the number of decompressed blocks per extent. compsize provides a count of extents. Compressed filesystems typically have smaller extents so they tend to need larger hash tables.

bees can only dedupe extents when it finds at least one matching block in each extent pair, so it needs enough entries in the hash table to record at least one block from every extent in the sliding window. Assuming that data is randomly and equally distributed and that all extents are either fully unique or fully duplicate, then with one block per extent in the hash table, we get a 50% chance of finding completely duplicate extents. The probability of dedupe increases according to the ratio of hash table entries and blocks per extent:

N = (hash_table_size_in_bytes / 16) / extents_in_filesystem
P(dedupe) = (N / N + 1)

At 1 block per extent, there's a 50% chance of deduping each duplicate extent. At 9 blocks per extent, it's 90%. At 32 blocks per extent, it's 97%. You can increase the hash table size to 32768 blocks per extent, but unless all of your extents are that large, a lot of space will be wasted for a tiny improvement.

The actual results vary from this quite a bit. Hashes are not a perfectly uniform distribution, so the probability of deduping some extents is higher than others. Hashes are not uniformly distributed to buckets, so the probability at N = 1 is closer to 25%. Extents have different lengths, so longer extents will have higher match probabilities than shorter extents. Duplicate files at the more recent end of the scan window have a higher probability of matching than files at the least recent end. Not every block in an extent is a duplicate, and the ratio of duplicate blocks to unique blocks changes the probability math.

If dedupe hit rate is more important than RAM usage, you'll set N a little higher than needed (i.e. use a larger hash table); if minimal RAM usage is important, you'll set N lower. Once N > 10 there's usually very little gain from making it larger, unless you have very small extent sizes.

The method in https://github.com/Zygo/bees/blob/master/docs/config.md for the hash table size doesn't require running compsize. It's based on test runs with some typical user data sets where we just ran bees on a filesystem test image with different hash table sizes.

ubenmackin commented 3 years ago

So, looking at compsize for one of my filesystems I see:

Processed 14202 files, 56924630 regular extents (57621443 refs), 7366 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       50%      3.9T         7.8T         7.9T       
none       100%      2.0T         2.0T         2.0T       
zstd        33%      1.9T         5.8T         5.8T 

Doing the math, with a 2 GB hashfile size:

N = 2147483648 / 16 / 56924630 = 2.3578 P(dedupe) = 2.3578 / 3.3578 = 0.702

And with a 1 GB hashfile (based on recommendation at https://github.com/Zygo/bees/blob/master/docs/config.md)

P(dedupe) 1.1789 / 2.1789 = 0.541

I've got 32 GB of RAM on this system, and this is just a NAS, so that memory is pretty much all free. So dedicating 2 GB to a hashfile is fine with me. But I just want to make sure that a jump from 54% to 70% is worth it, and that bees won't spend extra time searching for diminishing returns.

Zygo commented 3 years ago

Anything from 0.5 to 0.9 is fine, and within that range more is usually better. Getting all the way to 0.99 is not usually worth the extra RAM cost, but that depends on how scarce RAM is.

Lookup times are constant--making the hash table larger increases the number of hash buckets, not their size. bees will run slower with a bigger hash table due to doing more dedupe operations, but this levels off at P=1.0 dedupe and any additional hash table size has no further effect.

If you're building a NAS appliance then it's reasonable to use extra memory for hash table, as long as the memory isn't better used for other things like page cache.

kakra commented 3 years ago

I wonder if it would be possible to use bees for Synology boxes... In theory one could cross-compile it for the platform and put it in the community packets. But I'm not sure how stable it is, Synology probably uses a kernel with a lot of btrfs patches and I'm not sure if that's backports or custom patches

Zygo commented 3 years ago

Synology runs on kernel 3.10? Unless they have been backporting a lot, 3.10 would be missing the tree search ioctl and the dedupe ioctl. Not sure if there's much point in a bees port after that.

ubenmackin commented 3 years ago

Is there a point where bees is "done", or someway to check on how far along it is? Or maybe there isn't a way to see like XXX of YYY complete, as the filesystem is changing, so there is no end state?

I ask because I've been basically running 24/7 for over a month, and bees still seems to use quite a bit of CPU...over 160% generally, and sometimes unpo 750%. This on a 4c8t cpu. The filesystem it is running on is a backup storage which only gets changes once a day at night. This is roughly 610 GB (uncompressed) being added and removed. It is largely VM image snapshots, and DD images of some raspberry pis. In total per compsize:

Processed 14478 files, 73020716 regular extents (163253903 refs), 7608 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       34%      3.3T         9.6T          11T
none       100%      1.5T         1.5T         1.8T
zstd        22%      1.7T         8.0T         9.2T

Maybe it is the kind of thing not to worry about? It doesn't seem to cause any performance issues, so maybe I just leave it alone. I do stop bees while the backups are running, but maybe that is even unnecessary.

kakra commented 3 years ago

Bees does not stop, it waits for new generation numbers showing up on any subvolume, then scans the new extents that appeared between the old and the new generation number. If you add new subvolumes/snapshots, it will probably scan from 0 for that.

@Zygo Does bees actually optimize for that? If you'd clone a subvolume from an existing one, bees should start from the scan position it was on for the parent volume instead of 0, right?

Zygo commented 3 years ago

The current bees architecture doesn't have any idea how much work remains. bees asks btrfs if there's new data, and if btrfs says no, then bees is done; otherwise, there is an unknown and unpredictable amount of work to do. This happens separately for every subvol - one subvol could be 90% done pass 1, while another subvol could be 20% done pass 2, and another is 100% done pass 17 and waiting for new data. It doesn't map to a single progress bar very well--if bees is not completely idle, it will be busy somewhere in the filesystem, until it becomes completely idle again at an unknown time in the future.

Once every subvol has been completely scanned, the minimum transid for new subvols can be advanced to the lowest existing subvol min_transid, because we know we've seen everything on the filesystem that could be older than min_transid. I guess we could look at the parent uuid of each new subvol and clone the parent uuid's crawl parameters while constructing its BeesCrawl instance, but after that point, we still have to read all the data from both subvols. If the subvol scan had only reached 10%, then we're still doing 80% more work than we could be doing with a non-subvol-based scanner.

Both of these problems are related to the way bees scans by subvol, which turns out to be the wrong way to scan a btrfs (btrfs sub find-new is a clever hack, but it doesn't scale). I'm focusing my limited time on eliminating subvol-based scans rather than micro-optimizing them. An extent- or csum-based scan does a single linear pass over the entire filesystem, which makes percentage-progress indicators easy, and avoids getting bogged down by snapshots because it can guarantee it reads every unique data block exactly once (even less than once, with csum-based scanning).

Also...the current upper limit of bees's mostly unoptimized scanning performance is about 1TB/day on an 4-core machine. If bees is running less than a full day, it might not be able to keep up with new data at 25GB/hour. In that case it will dedupe as much as it can, but will never enter the "waiting for data" state. The scanning performance is the 'block_bytes' rate in beesstats (under the RATES: heading)--if that's below 7.1e+6 (7 MB/s, or 600 GB per day if that rate is sustained over 24 hours) then bees isn't keeping up.

ubenmackin commented 3 years ago

Thanks for the detailed answer! Filesystems is not my wheelhouse, but I like learning as much as I can. So getting good information like this is interesting to me.

Also...the current upper limit of bees's mostly unoptimized scanning performance is about 1TB/day on an 4-core machine. If bees is running less than a full day, it might not be able to keep up with new data at 25GB/hour. In that case it will dedupe as much as it can, but will never enter the "waiting for data" state. The scanning performance is the 'block_bytes' rate in beesstats (under the RATES: heading)--if that's below 7.1e+6 (7 MB/s, or 600 GB per day if that rate is sustained over 24 hours) then bees isn't keeping up.

It looks like right now, my block_bytes is 1.02917e+07, which when extrapolated is about 889 GB/day. So with currently 11 Terrabytes, it would take about 13 days to just process through that data if it were static. And then with the adding/removing of 600 GB/day, it'll take some time to get it all processed.

The fact it is only doing about 10MB/s, leads me to believe it is not IO constrained. I am running on spinning disks, so IOPS is lower than say a SSD pool. If I was all SSD, would we expect things to be any faster? Is this a situation where throwing more RAM at it could help? Or is it just really compute expensive to calculate the dedupe?

Massimo-B commented 1 year ago

As I also think about a big btrfs-based NAS and opened https://github.com/Zygo/bees/issues/262 to ask if there is bees on Synology NAS, as @kakra already mentioned in https://github.com/Zygo/bees/issues/153#issuecomment-703674688

(Late) question to @ubenmackin: If you are running all those drives on the Ryzen machine, why don't you combine all the drives to one big btrfs pool?

Big btrfs should always scale better in these points:

Only reasons to separate the btrfs could be:

kakra commented 1 year ago

Big btrfs should always scale better in these points:

Also: more spindles for data access especially when not using a RAID profile (extents spread out over multiple disks, reads and writes can happen in parallel, less lock contention). I've found that single data mode across different spindles works better than raid-0 for workloads where multiple processes access different data in parallel. For single processes raid-0 may be faster but it would also occupy all spindles at almost the same time (per stripe, depending on request block size).