Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
661 stars 55 forks source link

What is too little memory for hash table? #192

Closed MarkoPaasila closed 3 years ago

MarkoPaasila commented 3 years ago

Is there a lower threshold below which the performance of bees significantly drops? What would be an indicator of having too little memory allocated for the hash table? I know it would be hard to know up front, but after running bees for a few days, would it be obvious from the output?

Zygo commented 3 years ago

It's not obvious from the output, because the output only tells us about matches bees found. There is no information about matches that were not found because the hash table is too small. bees is designed to provide best effort results, so there will still be a lot of deduplication happening even when the hash table is absurdly small.

The most reliable (but not very practical) way to find out optimum hash table size is to reset the filesystem to its initial state (e.g. roll back a snapshot on a VM disk image containing a random sample of the data), then run bees again with double or half the size of hash table and see if it finds more duplicates (or enough more duplicates to justify the extra RAM cost). The results from such a test only work for the filesystem that is tested--data content and write order affect the hash table size required, but the ratio falls between 1000:1 and 10000:1 in all but the most extreme cases.

There are stats for hash table eviction rate (hash_evict) vs. insertion rate (hash_insert), but these will not tell us very much. bees normally evicts hash table entries either because they are out of date, or they are up to date but not matching any new data. The stats don't distinguish between those cases because bees would have to read the blocks from disk to tell whether the data is up to date, and there can be many thousands of hash table evictions every second.

It might be possible to implement a diagnostic mode where bees samples some of the evicted hashes (say one in ten thousand, or one per second, whichever is less) to see if they are still up to date. If there are a lot of evicted hashes that are still up to date (like 90% or more), then that would indicate the hash table is too small and potential duplicates are being lost. If the evicted hashes are mostly out of date (90% or more) then the hash table is larger than necessary, and memory is wasted storing hashes of unique data that won't match anything. If both rates are somewhere between 10% and 90% then the hash table is the right size, unique data is being evicted and we aren't losing too many duplicate blocks.

MarkoPaasila commented 3 years ago

I had missed the part in the docs where you advised that less than 128MB/TB might significantly lower performance, and for compressed filesystems 2-4x more would likely be required. That knowledge suffices for me, and your explanation above makes it clear that anything more accurate is quite hard to achieve.