Toxic hashes matches, bees DB size

Hi Zygo, and thanks for the best deduplication software ever :)

This is a couple questions rather than an “issue”.

I have a backup NAS that's been running BTRFS RAID-5 with 3x4TB disks ans bees for a couple years, it was getting quite full so I dropped in a 4th disks and rebalanced.

Following bees doc advice, I stopped bees before rebalancing and distroyed most of my (numerous) snapshots so bees wouldn't have to rescan dozens of mostly identical subvols when doing its new “first pass”.

I erased .beeshome contents and upped the hash table size to 4GB in beesd conf (used to be 2GB) now that with 4 disks I have about 12 TB of usable storage space.

Once balance finished, I restarted bees, fresh start, so it's doing it's “first pass”.

I noticed the following :

Until beesd had processed about half of my (~220) subvols, the hash table occupancy displayed in beesstats.txt was increasing with a nice Gauss curve that was centered at about 50% when about 50% of the subvols were processed.
Then it started filling very quickly until eventually only leaving a vertical bar at 100%
By that time, bess was processing subvols with a number of compressed machine clones, totalling about 3 TB of compressed data.

So I'm not surprised that all of that compressed data, and in most cases encrypted data, results in mostly unique hashes, thus filling the hash table quiclkly.

But I wonder if my hash table size is big enough and will provide with good deduplication performance, once everything is over ?

Given that I keep uploading big encrypted machine clones some couple months apart, but most of the NAS use is regular daily file backups with a whole lot of snapper snapshots.

Now about “toxic” extents :

Per beesstats.txt, I have 75 “toxic” hashes in the table, which is a fairly low number, so that seems OK to me.

But, checking the system log while bees processess, I see that about every jpeg or pdf file processed, and a few compressed files also, trigger a “WORKAROUND: abandoned toxic match for hash” exception.

So I'm wondering if bees will just drop attempting to dedup all jpeg and pdf files (which would be a pity for I know there is a huge lot of duplicates in there), or if it will just skip deduplicating the matching toxic part (maybe containg some standard jpg or pdf header or whatever) ?

Some insight on this would help me much !

Thanks in advance, Best regards.

Zygo / bees

Toxic hashes matches, bees DB size #245