Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
625 stars 56 forks source link

bees breaks existing reflinks? #270

Open jaens opened 7 months ago

jaens commented 7 months ago

While running bees (version 0.9.3), I noticed space usage slightly increasing on the disk. From the log, I think I saw it trying to "deduplicate" files that were already "whole-file" reflinked (via eg. cp --reflink).

Does bees preserve existing reflinks?

kakra commented 7 months ago

Bees will break existing reflinks if it sees other duplicate chains of blocks matching the ones just found. But it will eventually clean up the unreachable extents after some time, just leave it running for long enough. This behavior is probably already described in the documents somewhere and it expected due to bees working very different from other deduplicators.

Also, in your situation, bees will probably add more metadata and thus also increase allocation somewhat.

jaens commented 7 months ago

Thank you for your reply. There is some documentation regarding snapshot gotchas, which I guess might also apply to reflinks?

So, based on what you are saying:

  1. If the reflink content is not duplicated elsewhere, bees will leave it alone? (although this, unfortunately, seems unlikely for my dataset of medium-to-large executable files etc. – it will generally find a page or two in common with some other random file...)
  2. bees, I think, is not perfect in finding duplicated extents (due to eg. hash table failures or "too many duplicates"?), so technically, if it does try to deduplicate such a reflinked file, the final result might be worse since it does not dedupe 100% of the file?
kakra commented 7 months ago

If your dataset in snapshots is already highly deduplicated, you can try stopping bees, open beescrawl.dat and set the min_transid to the value of max_transid per each line. Then it should leave the already existing extents alone and not scan them. But then you loose an opportunity of deduplicating new content with existing data in the snapshots. See also here where I was seeing a similar effect due to loss of beescrawl.dat: https://github.com/Zygo/bees/issues/268

Also, if the hash table fills up, bees should automatically start ignoring very small duplicate blocks, so it should ignore creating 4k or 8k reflinks if you don't size the hash table too big. The docs have a table of typical extent size vs hash table size per unique dataset size. Trying to dedup every single small extent is bad for performance, so you should not over-size your hash table.

Zygo commented 1 week ago

bees processes one reflink at a time, except in the case where it has to split an extent into duplicate and non-duplicate parts. In that case, the non-duplicate portion is moved to a new extent, and all reflinks that refer to the non-duplicate blocks are replaced at once, but the reflinks to the duplicate blocks are handled one reflink at a time using hash table matching. This means that the non-duplicate portion of the data occupies additional space until the last duplicate blocks are removed from all reflinks.

If the hash table evicts all hashes of the duplicate portion of the data before the last reflink is removed, then both the original extent and a temporary copy of its unique portion will persist in the filesystem. That consumes additional data space.

bees also tends to collect unreachable blocks in extents. Extents with unreachable blocks tend to be older than extents with all reachable blocks, and bees always keeps the first extent it encountered when it finds a duplicate. Technically this doesn't allocate any new data space, but it can make it harder to release blocks from deleted files that contain "popular" data.