Just a few things I'm trying to figure out about bees.

bhansonfs commented 3 years ago

Hello,

After trying every deduplication tool for btrfs thusfar, I am thoroughly impressed with bees. I just have a few questions for my primary personal use case:

I have 20 TB RAID running on spinners, containing a LUKS partition, and BTRFS within that, on Ubuntu 20.04. I ran bees for about a month while also running snapper, generating snapshots hourly. I don't think I ever hit the threshold where bees was 'done' getting primed and doing its initial dedup across all snapshots. I recently wiped my system and am starting from scratch. I'm allocating about 16 GB of bees hash table data to the 20 TB raid.

I've read the documentation and a lot of the discussion on bees, and I'm unclear on a couple things I was hoping could be cleared up.

1) Mixed purpose files

About 2/3 of my storage device is large movies, between 1 and 20 GB large. Almost never are these files duplicated, and if they are, the files themselves are static and don't partially change in the way a virtual machine image would. The 1/3 of storage is images and program data, and quite a bit of it is redundant, most notably docker temporary conatiner builds and sometimes an entire copy of my documents/home folder.

In a perfect world, I would like to be able to copy a ton of small files and know that they will all be deduplicated shortly. I am concerned that the hash table for finding deduplicates would be almost entirely filled with hashes of parts of movies, and this would push out the hashes of smaller files, making it less likely that a large collection of small files would be deduplicated. Is this concern valid? Will many small files still be able to be deduplicated even when existing on the same disk as a ton of much larger files?

2) Snapshots

I have enjoyed using snapper to create hourly snapshots that expire in a month so far. Is this compatible with bees? If the 'live' version of the file system has been deduplicated, and a snapshot gets made, does bees have to scan the entire snapshot looking in vain for duplicate data again? I realize that if bees was running against SSD media, this might not be a major concern, but on spinning media, I worry that my snapshots are being created faster than bees can meaningfully process them. What is the communities thoughts on this?

For a decade now, I've been imagining a file system in my head in which every piece of data that was stored was hashed, then stored by hash, making all data deduplicated by nature of how the file system is designed. I am hoping that I have found a reasonable substitute for this with BTRFS + bees, bees in particular seems paradigm changing. Thank you very much in advance for the answers to these questions!

Zygo commented 3 years ago

The hash table is effectively an LRU cache of hashes. Once the table is filled, each new block scanned evicts the hash of some older block, as you suggested. If a block is found to be a duplicate, its hash is pushed to the front of the LRU list, delaying its eviction as long as possible. If your write pattern is alternating between writing batches of movies and batches of small duplicate files, then the blocks with the higher dedupe hit rate (i.e. the small files) will be preferred in the hash table. Some duplicate blocks will still be lost, but hopefully only a few percent. As long as a batch of movies is not large enough to flush out the entire hash table, there will still be a good dedupe hit rate.
The scan algorithm works like this:
- for each subvol on the filesystem, in parallel:
- subvol.max_transid = filesystem max transid, i.e. highest transid of any subvol
- subvol.min_transid = filesystem min transid, i.e. lowest min_transid of any subvol (or 0 if we are running for the first time)
- scan subvol extent references between min_transid and max_transid until end of subvol
- subvol.min_transid = subvol.max_transid
- subvol.max_transid = filesystem max transid
- repeat

Initially, all subvol scan passes must read all data that currently exists on the filesystem, so we set min_transid = 0. In order to avoid reading anything twice, we set subvol.max_transid to the current transid, so we will not scan anything that was added to the filesystem after we started this pass of the subvol scan.

When we reach the end of one pass over a subvol, we move subvol.max_transid to subvol.min_transid and set subvol.max_transid to the current filesystem transid. In the next pass, we will scan everything that was added after we started the previous pass but before the start of the current pass. If min_transid == max_transid we don't scan anything this pass, and wait for a timer thread to wake us up with new data.

Once every subvol has raised its subvol.min_transid, we know we have seen everything on the filesystem older than min(all subvol.min_transid) at least once[1]. That means we don't have to look at anything with a transid below filesystem.min_transid any more, on any subvol.

If a snapshot is created, it will contain data that is included in some other subvol and possibly already scanned; however, we don't know exactly which data this is without knowing precisely which subvol was snapshotted (which we can find out from btrfs) and where the bees scanner was at the time the snapshot was created (which we can't, unless we keep a log and search it). So we start all new subvol scans from the beginning using the filesystem's min_transid.

The problem with rapid snapshot creation is this: if we keep creating new subvols before we completely scan the old ones, the new subvols start with the lowest min_transid of any existing subvol scan, and they prevent min_transid for the entire filesystem from moving forward. Each new subvol starts from the beginning (or at least a long time ago). Each new subvol also divides the available scanning resources by the number of active subvol scans, which makes it harder to complete subvol scans. If bees is not able to complete a subvol scan in an hour with its resources divided by 1, it certainly can't complete a subvol scan in an hour with its resources divided by 2 or larger numbers. After a month of snapshots created every hour, there are hundreds of active subvol scanners, each running hundreds of times slower than a single subvol scan would (thousands of times slower on spinning media). bees never catches up, and snapshots end up being deleted before bees can finish scanning them.

One workaround for this is to use -a (--workaround-btrfs-send). If snapper makes read-only snapshots then we can avoid scanning them. Dedupe will occur in the writable subvols, and the snapshots will capture the dedupe state at the time the snapshot is made. Not ideal, but better than the alternatives.

[1] It is possible to hide data from the scanner by reflinking it from a high-numbered inode (one that will be scanned in the future) to a low-numbered inode (one that was already scanned in the past) in the same subvol, or by reflinking from an unscanned subvol into a scanned one, then delete the original reflink before the scanner reaches it. Also, btrfs balance will move data around and invalidate the saved locations from the bees scan. Most of the time these don't matter for typical dedupe use cases, but if you were trying to build a secure index of all the file contents on a btrfs filesystem, the approach bees currently uses doesn't always work.

Zygo commented 3 years ago

which we can find out from btrfs

...if there is an existing, unbroken chain of snapshots between the new subvol and the one bees was previously scanning. So if something like this happens:

sub snap a b; sub snap b c; sub del b

then bees won't be able to tell that c is related to a--or worse, will think c is related to some other subvol, and miss a lot of data.

bhansonfs commented 3 years ago

Zygo, thank you for the insight!

Zygo / bees

Just a few things I'm trying to figure out about bees. #176