Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

Tips needed to run bees a few hours a day and monitor progress #206

Open gc-ss opened 2 years ago

gc-ss commented 2 years ago

I intend to run bees, for the first time, against a 7TB btrfs FS with 3 subvolumes. No snapshots exist. Ubuntu 20.04.3 LTS (5.10). There's 2TB free space on the FS.

i. The FS has 3 subvolumes: sv1, sv2, sv3. No snapshots exist of them ii. subvolumes sv1, sv2 are almost clones of each other. Like 99%. They are 3TB in size each. However, I've run zstd compression on sv2. I don't know how to check what this compression has got me though

Here's my plan and questions. Looking for any tips, gotchas, etc:

  1. How do I check what the zstd compression on sv2 actually got me?
  2. I would like to immediately snapshot sv1, sv2, sv3. sv1, sv3 are not compressed. Is this going to affect bees in any way?
  3. I don't believe bees will be able to dedup anything between sv1, sv2, although they are almost clones of each other becuase I've run zstd compression on sv2, so it's likely the blocks themselves are entirely different between sv1, sv2 - is this accurate?
  4. I DUALboot the system, such that the Ubuntu setup that bees will run in will be on only 8 - 10 hours/day - typically at night when I'm not using it. I assume bees will start up and run automatically when I boot into Ubuntu: do I just shut this system down while bees is running or should I send it a signal before I proceed to shut this system down - specially because I will be cycling it so much (I don't want to risk corruption)?
  5. Reading around a bit, it looks like the effective R/W rate of bees will be ~10MB/s. If so, 7TB will likely take 200 hours, or 2 weeks at at 8 - 10 hours/day. Is there a way to look at the progress/status of bees?
  6. I plan to use btrfs send receive soon on these subvolumes/snapshots - is --workaround-btrfs-send needed?
kakra commented 2 years ago
  1. How do I check what the zstd compression on sv2 actually got me?

You cannot actually mount the same btrfs with different compression options even when it's different mount points. If say, you dual boot and use one subvolume exclusively in one boot environment, it mostly works, but it still applies to the whole filesystem: If you modify a file in another subvolume, it will use the global compression settings from the file system as it was mounted.

  1. I would like to immediately snapshot sv1, sv2, sv3. sv1, sv3 are not compressed. Is this going to affect bees in any way?

It will extent the time needed to scan the data.

  1. I don't believe bees will be able to dedup anything between sv1, sv2, although they are almost clones of each other becuase I've run zstd compression on sv2, so it's likely the blocks themselves are entirely different between sv1, sv2 - is this accurate?

It will: current bees looks at the contents of the extents not the physical blocks, so it can dedupe compressed extents with uncompressed extents - and it will probably rewrite some of that data in your preferred compression. But compressed extents are just 128k max so you'll get a huge metadata and hash table overhead from it. But it will still dedupe: If both subvolumes are mostly clones (aka contain mostly identical data), that will probably outweigh the overhead. But since this is a lot of data, you may need to reset bees to start from scratch after the first pass so it will pick up additional dedup opportunities it previously pushed out of the hash table too early.

  1. I DUALboot the system, such that the Ubuntu setup that bees will run in will be on only 8 - 10 hours/day - typically at night when I'm not using it. I assume bees will start up and run automatically when I boot into Ubuntu: do I just shut this system down while bees is running or should I send it a signal before I proceed to shut this system down - specially because I will be cycling it so much (I don't want to risk corruption)?

There should be no problem if you run bees EXCLUSIVELY in only one environment. You can probably share the same hash table and status file on both installations IF you want to run it in both environments. If you run two different instances of bees, they will mostly do just duplicate work but they should still co-exist just fine. There will be no corruption.

Also, bees stores it progress every 15 minutes. If you just kill it, the worst that will happen is that it repeats the last 15 minutes of work. There will be no corruption. If you shut it down cleanly, it will try to flush its last state onto disk, then exit. But it's designed with being shut down uncleanly in mind.

  1. Reading around a bit, it looks like the effective R/W rate of bees will be ~10MB/s. If so, 7TB will likely take 200 hours, or 2 weeks at at 8 - 10 hours/day. Is there a way to look at the progress/status of bees?

You can look at the status files it writes in text format: Take note of the transaction ids per subvolume (beescrawl.dat) when you first started bees, if it finished the first pass, it will have reached those transactions ids (and continue from there). I think there's an indicator in the crawler worker status (the other text file with lots of statistical numbers), but I don't remember what you should look for.

While bees works through each subvolume, actual disk usage may increase until all stale extent data is cleaned up. bees is not designed as a one-time deduplicator, you should just leave it running and let it do its thing, permanently, in the background, probably running it as a service, and not bother with any details.

Bees works by finding common 4k blocks with identical data. When it found a dedup candidate, it will read forward and backward (in 4k steps) in the candidates to find the maximum range of identical data. It will then let the kernel atomically share those ranges (which ensures that the data is actually identical, the kernel will ensure it, no corruption possible even when you write the files at the same time). So even for 100 MB of consecutive duplicate data, it will just store the hash of one 4k block. Its performance depends a lot on recently written data still being present in the cache - except for the first run for obvious reasons. You should size your hash table to still allow for a lot of cached data.

  1. I plan to use btrfs send receive soon on these subvolumes/snapshots - is --workaround-btrfs-send needed?

I cannot help with that.

Zygo commented 2 years ago
  • How do I check what the zstd compression on sv2 actually got me?

compsize sv2 will report usage broken down by compression type.

  • I would like to immediately snapshot sv1, sv2, sv3. sv1, sv3 are not compressed. Is this going to affect bees in any way?

bees will treat the snapshots like any other reflink copies. Creating a snapshot before dedupe will increase the dedupe time, as each reflink must be deduped separately. If the snapshot is created after dedupe then bees will have nothing to do with the new snapshots since they are already deduped.

  • I don't believe bees will be able to dedup anything between sv1, sv2, although they are almost clones of each other becuase I've run zstd compression on sv2, so it's likely the blocks themselves are entirely different between sv1, sv2 - is this accurate?

Not at all. bees deduplicates compressed data, as do some other dedupers.

bees cannot dedupe inline extents or extents in nodatasum files (which includes all nodatacow files). Anything else is fair game.

bees will not guarantee which extent is chosen when deduping a compressed extent with an uncompressed extent, or an extent with a different compression method. Usually the first extent encountered is kept, since that extent will be listed in the hash table. In this case, it will be a race to see which subvol reader hits the data first.

If bees must copy data in order to deduplicate, then the copy will be compressed with the compression method specified in the compress mount option.

If sv1 and sv3 are not yet compressed or deduped, then you might want to run btrfs fi defrag -rczstd sv1 sv3 first, to ensure all copies of the data are compressed before dedupe or snapshots are created.

  • Reading around a bit, it looks like the effective R/W rate of bees will be ~10MB/s. If so, 7TB will likely take 200 hours, or 2 weeks at at 8 - 10 hours/day. Is there a way to look at the progress/status of bees?

See issue #175. The method bees uses to traverse the filesystem is very fast for getting the next new data to read, but very slow at figuring out how much data remains if that amount isn't zero. bees can tell you which inode it is processing, but not how many inodes there are or how many remain.

bees will stop and start automatically as needed to process new data in the filesystem. If you want bees to only run within a constrained time window, then start bees at the beginning of that window and stop it (kill with SIGTERM or stop the systemd service) at the end.

  • I plan to use btrfs send receive soon on these subvolumes/snapshots - is --workaround-btrfs-send needed?

Not needed since kernel 5.2. --workaround-btrfs-send doesn't scan or dedupe read-only snapshots. Some users find that a useful feature in and of itself, i.e. you may use the workaround even if you don't need it.

smurfix commented 2 years ago

--workaround-btrfs-send doesn't scan or dedupe read-only snapshots. Some users find that a useful feature in and of itself, i.e. you may use the workaround even if you don't need it.

Maybe rename (or alias) the option to --skip-readonly-subvolumes then?