Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
661 stars 55 forks source link

Toxic hashes matches, bees DB size #245

Open petaramesh opened 1 year ago

petaramesh commented 1 year ago

Hi Zygo, and thanks for the best deduplication software ever :)

This is a couple questions rather than an “issue”.

I have a backup NAS that's been running BTRFS RAID-5 with 3x4TB disks ans bees for a couple years, it was getting quite full so I dropped in a 4th disks and rebalanced.

Following bees doc advice, I stopped bees before rebalancing and distroyed most of my (numerous) snapshots so bees wouldn't have to rescan dozens of mostly identical subvols when doing its new “first pass”.

I erased .beeshome contents and upped the hash table size to 4GB in beesd conf (used to be 2GB) now that with 4 disks I have about 12 TB of usable storage space.

Once balance finished, I restarted bees, fresh start, so it's doing it's “first pass”.

I noticed the following :

So I'm not surprised that all of that compressed data, and in most cases encrypted data, results in mostly unique hashes, thus filling the hash table quiclkly.

But I wonder if my hash table size is big enough and will provide with good deduplication performance, once everything is over ?

Given that I keep uploading big encrypted machine clones some couple months apart, but most of the NAS use is regular daily file backups with a whole lot of snapper snapshots.

Now about “toxic” extents :

Per beesstats.txt, I have 75 “toxic” hashes in the table, which is a fairly low number, so that seems OK to me.

But, checking the system log while bees processess, I see that about every jpeg or pdf file processed, and a few compressed files also, trigger a “WORKAROUND: abandoned toxic match for hash” exception.

So I'm wondering if bees will just drop attempting to dedup all jpeg and pdf files (which would be a pity for I know there is a huge lot of duplicates in there), or if it will just skip deduplicating the matching toxic part (maybe containg some standard jpg or pdf header or whatever) ?

Some insight on this would help me much !

Thanks in advance, Best regards.

kakra commented 1 year ago

I think the toxic hash is for single block or extents, and it tends to match very common file headers. I had a similar effect with game data files in Steam which contained lots of toxic hashes, and Zygo explained that for very common headers, this is expected: It's a tiny data block with lots of duplicates, iow, high reference counts baldy affecting performance for only a little benefit in storage efficiency. Other parts of such files are unaffected. I think there was a discussion that very short duplicate blocks should probably be skipped completely but I don't know if that has been implemented since then.