Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

How can I replace a deduplicated disk with a new one? Can I use zram? #203

Open aventrax opened 2 years ago

aventrax commented 2 years ago

Hello everybody and many thanks @Zygo for this excellent software.

I'm trying deduplicating my backup on a RaspberryPI3 on a slow USB2 disk. I have 760GB of data so this is extremely slow indeed bees removed 120GB of data in 2 days and it still running the first pass. I hope to finish in one week, then the daily incremental snapshots should not be a problem to dedup in a couple of hours.

My questions:

  1. Once finished I want to replace the backup disk, how Can I do that without having to re-deduplicate everything? Should I use dd? The new drive will be smaller (2TB this one, 1TB the new one, but I have only 760G of data), can I shrink the BTRFS partition and then use dd to copy to the new disk without having to re-dedup?

  2. PI3 has just 1G of RAM, now I'm using 512M for the hash table but it's full and bees did not even finish the firsh run, I'd want to double the hash table size but the memory is not enough. Can I use zram-swap to increase the memory? I already have it configured and I have 760G of ram plus 690G of swap, but the swap is compressed ram so it should be fast enough. Can bees use 1G of memory in this situation where 3/4 of it is real RAM and 1/4 is compressed ram seen as swap?

Many thanks.

kakra commented 2 years ago

Q1: You can connect both disks to the same system, then run either btrfs replace (not sure if it likes that with just one device), or you run btrfs device add to add the new disk to the pool, then btrfs device remove to remove the old one. On the way, ensure that the profiles don't get updated to RAID otherwise it will refuse to remove the old disk.

This way, the chunk storage is transferred over to the new disk as-is, no copying of files needed, no re-duplication will happen, and you'd copy only data that's actually allocated.

After the process finished, you should probably wipefs the old partition so you don't accidentally re-use some UID. This can be one pitfall that dd would give you. You should avoid using dd for btrfs partitions.

Q2: You cannot use zram to increase memory for bees because bees makes the hash-table locked in memory: No swap will occur, no benefit from zram. I don't even think it compresses that well so it won't help anyways but just massively increase the time needed. Depending on how much unique data you have stored on your FS, you may not need a big hash table.

The RAM locking is actually important here: The hash table won't swap out, cannot be compressed in memory, and even if only half full, that RAM cannot be used for something else. So if you give it 3/4th of your RAM, you'll essentially shrink your RAM size to only 25% for kernel, cache, and apps, because you dedicated 75% to bees (and only bees).

aventrax commented 2 years ago

Many thanks. So having 960MB available what I think is going to 768MB for bees in this way:

960 - 768 = 192 MB 192 / 2 = 96 for zram 96 * 3 = 288 MB of "zram-swap" 288 + 96 MB = 384MB of Ram for kernel/system/apps.

Does this have any sense? My PI is there just for backup purpose and nothing else.

kakra commented 2 years ago

btrfs is a file system which quite desperately needs RAM sometimes. Having just 192 MB available may make the system unstable. Since many allocations of btrfs are unpageable, they won't go to swap, the kernel may OOM. YMMV but I'd suggest to use only 512M for bees, or upgrade to a Pi with 4GB of RAM.

Your final conclusion 384M for kernel/system/apps probably does not work that way. It's more like up to 96 MB for kernel/caching, meaning 0-288 MB for system/apps. The 96M are really low, it's barely above the absolute minimum the kernel should keep always free. For btrfs, you actually want as much RAM as possible for caching otherwise it will slow down to a crawl.

Here are some numbers from the docs:

It says for 1 TB of unique data and with an avg extent size of 128k, you'd need a hash table of 128M. Upping that to 256M would still allow 1 TB of unique data with an avg extent size of 256k. I suppose, you are using compression on the target disk, so your extents tend to be 128k in avg size. OTOH, compressed extents are 128k in maximum which creates a lot more extents, needing a bigger hash table. I'd use compress without compress-force, and a hash table between 384-512M to keep your system at least at somewhat sane performance.

But in the end, only @Zygo knows best. But I'm sure he'd also recommend to not leave the system with only 96M for caching.

Zygo commented 2 years ago

On a machine this size, I wouldn't ever put more than 256M in the bees hash table, and I'd probably even lower it to 128M for performance. Don't forget bees needs ~100M for assorted data structures (IO buffers and extent maps), and btrfs can use up to 512M of non-swappable kernel RAM for a single transaction on a 1TB disk. These sizes are fixed, while the bees hash table is variable, so the bees hash table must become smaller to fit the machine's RAM size.

On a disk this size, 128M is the optimum size for a hash table. It overcommits the hash table 30x, but you only need 3% of the hashes to find most of the duplicates if your average extent size is 128K. The hash table will fill up early on, then bees will select a sample of the hashes and evict the rest. If you reset beescrawl.dat and start bees from the beginning, a new random sample may find different duplicate data each time (but less each time). If your average extent size is smaller than 128K then you might need more RAM or you'll have a slightly lower dedupe hit rate.

btrfs replace is the best way to move btrfs data from one disk to another (assuming you don't have something like LVM or mdadm underneath). If the new disk is smaller than the old one, you'll have to use btrfs fi resize to shrink the filesystem first, and when using btrfs replace give the numeric devid instead of the device name. replace avoids the issues with profiles switching to raid1 because replace does not add a disk to the filesystem. Replace also performs 20-40x better than the equivalent add/remove commands.

The hash table does compress a little. About 61% of the bits in the hash table are zero, and half of what's left has low entropy. Very early in bees development I experimented with the idea of compressing the hash table in RAM, and while it does allow packing about 25% more hash table entries in the same amount of memory, the problem is that it requires thousands of times more CPU to do hash lookups. Every hash lookup requires decompressing a page, since there is no RAM left over to cache uncompressed pages. Pages are used in random order, so an uncompressed page cache doesn't start to be efficient until it is much larger than 25% of the RAM--but compression only saves 25%, so an effective cache uses more RAM than simply storing all of the data in RAM uncompressed. Hash insertion requires compressing pages after scanning every block, in addition to all of the above costs. zram/zswap has worse problems because it's the same algorithm, but less efficiently implemented.

Hash tables are stored compressed on disk if you've enabled compression in the btrfs mount options, or set the compress property on $BEESHOME or beeshash.dat. btrfs itself keeps setting the NOCOMPRESS flag on hash tables, so if you haven't set the property then your hash table is probably not being compressed.

kakra commented 2 years ago

Doesn't seem to compress too well (just converted over to a chattr +c .beeshome):

 /m/b/.beeshome > sudo lsattr
--------c------------- ./beeshash.dat
--------c------------- ./beescrawl.dat
--------c------------- ./beesstats.txt

 /m/b/.beeshome > ls -al
insgesamt 1048612
drwxr-xr-x 1 root root         76 16. Nov 17:45 ./
drwxr-xr-x 1 root root         78 16. Nov 17:44 ../
-rw------- 1 root root      12201 16. Nov 17:44 beescrawl.dat
-rw------- 1 root root 1073741824 16. Nov 17:45 beeshash.dat
-rw------- 1 root root       6858 16. Nov 17:45 beesstats.txt

 /m/b/.beeshome > sudo compsize .
Processed 3 files, 790 regular extents (790 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       99%     1023M         1.0G         1.0G
none       100%     1021M        1021M        1021M
zstd        74%      1.7M         2.3M         2.3M
kakra commented 2 years ago

Just after a few minutes, compressed extents go down while usage increases:

 /m/b/.beeshome > sudo compsize .
Processed 3 files, 983 regular extents (1152 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       99%      1.1G         1.1G         1.0G
none       100%      1.1G         1.1G        1021M
zstd        74%      1.6M         2.2M         2.2M

Before the conversion, the hash table used 2.2G on disk with exactly 1G referenced. So it probably creates a lot of slack space over time?

Zygo commented 2 years ago

Try `btrfs fi defrag -czstd' if it wasn't compressed before:

# compsize beeshash.dat
Processed 1 file, 65537 regular extents (65537 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       74%      5.9G         8.0G         8.0G
none       100%      4.0K         4.0K         4.0K
zstd        74%      5.9G         7.9G         7.9G

but note that not every extent will compress, so after defrag there will be some garbage blocks lying around until a full rewrite is completed.

Also sometimes btrfs just does weird stuff:

# btrfs-search-metadata file beeshash.dat | grep ' ram_bytes 4096 '
extent data at 4155375616 generation 7932872 ram_bytes 4096 compression none type regular disk_bytenr 48239637434368 disk_num_bytes 4096 offset 0 num_bytes 4096

bees never writes a 4K block to the hash table, it is always 128K at a time. Maybe the kernel forced writeback at that point?

kakra commented 2 years ago

Okay, looks better now:

# sudo compsize beeshash.dat
Processed 1 file, 8192 regular extents (8192 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       74%      767M         1.0G         1.0G
zstd        74%      767M         1.0G         1.0G

Also sometimes btrfs just does weird stuff

I'll leave that secret to you for uncovering... ;-)

aventrax commented 2 years ago

I think the crawl finished, 781GB of data reduced to 633GB with the hash table of 512M, it tooks 3 days.

Thanks to your support I have tried to "replace" the disk but unfortunately the first step (shrink) failed 2 times with a crash after 1/2 hours of work. I did not lost anything but I had weird errors on kernel log so I changed disk and now I'm trying your settings (128M hash table) with less data (I clean a lot of stuff). Crawl is working now.