Open petaramesh opened 1 year ago
I think the toxic hash is for single block or extents, and it tends to match very common file headers. I had a similar effect with game data files in Steam which contained lots of toxic hashes, and Zygo explained that for very common headers, this is expected: It's a tiny data block with lots of duplicates, iow, high reference counts baldy affecting performance for only a little benefit in storage efficiency. Other parts of such files are unaffected. I think there was a discussion that very short duplicate blocks should probably be skipped completely but I don't know if that has been implemented since then.
Hi Zygo, and thanks for the best deduplication software ever :)
This is a couple questions rather than an “issue”.
I have a backup NAS that's been running BTRFS RAID-5 with 3x4TB disks ans bees for a couple years, it was getting quite full so I dropped in a 4th disks and rebalanced.
Following bees doc advice, I stopped bees before rebalancing and distroyed most of my (numerous) snapshots so bees wouldn't have to rescan dozens of mostly identical subvols when doing its new “first pass”.
I erased .beeshome contents and upped the hash table size to 4GB in beesd conf (used to be 2GB) now that with 4 disks I have about 12 TB of usable storage space.
Once balance finished, I restarted bees, fresh start, so it's doing it's “first pass”.
I noticed the following :
So I'm not surprised that all of that compressed data, and in most cases encrypted data, results in mostly unique hashes, thus filling the hash table quiclkly.
But I wonder if my hash table size is big enough and will provide with good deduplication performance, once everything is over ?
Given that I keep uploading big encrypted machine clones some couple months apart, but most of the NAS use is regular daily file backups with a whole lot of snapper snapshots.
Now about “toxic” extents :
Per beesstats.txt, I have 75 “toxic” hashes in the table, which is a fairly low number, so that seems OK to me.
But, checking the system log while bees processess, I see that about every jpeg or pdf file processed, and a few compressed files also, trigger a “WORKAROUND: abandoned toxic match for hash” exception.
So I'm wondering if bees will just drop attempting to dedup all jpeg and pdf files (which would be a pity for I know there is a huge lot of duplicates in there), or if it will just skip deduplicating the matching toxic part (maybe containg some standard jpg or pdf header or whatever) ?
Some insight on this would help me much !
Thanks in advance, Best regards.