Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
674 stars 56 forks source link

How to Force Rescan after increasing Hash Table Size? #288

Closed KeinNiemand closed 1 month ago

KeinNiemand commented 3 months ago

Currently I'm running bees with a hashtable that's much smaller then recommended since I'm currently using a Rasberry Pi 4 with only 2GB of RAM (I have about 14TB of usable data but only have a 512mb hash table so I only have ~37mb per tb). When I upgrade my hardware (I'll keep the same drives and all the data) to something with more RAM I want to increase that hash table size to the recommended 128-256mb/tb, however I'm unsure of what will happen to my existing data once I increase the size, will bees automatically rescan everything so I can get more space savings from the larger hash table or do I need to something to force it to rescan everything and get those extra savings?

I know this is not really a bug or an issue with bees it self and that Github issues aren't really meant as support, but I guess you could call this a documentation issue, it might be a good idea to document how to force a rescan that's required to gain addinal savings after a size increase.

kakra commented 3 months ago

I'd say the best way is clearing the beeshome directory and starting fresh. The second best way is deleting the hash table only and recreating it with the new size. The latter option will forget all data but won't rescan all your snapshots. If you don't have snapshots, starting completely fresh may be a good option.

In theory, you could resize the hash table to the new size but that didn't work well for me in the past.

If you have a high dedup rate, starting completely fresh may temporarily increase the allocated space of the filesystem.

Zygo commented 3 months ago

To save space and store more hashes in a given table size, some information about each entry is coded in its location within the hash table, and there is no index--each table entry is located at a position determined by the hash value modulo the hash table file size. If the hash table size changes, the existing hashes are no longer at the right locations for a search to find them. If the table size is multiplied or divided by a power of 2, some hashes will still be in the right locations, but it's about a 50% loss for every power of 2 step.

It's possible to build a tool which reads one hash table and inserts the hashes into a new table at the right positions, but it hasn't been done yet.

If you have one subvol and make a lot of snapshots of that subvol, you can set the min_transid field for that subvol to 0 and keep all the other lines of beescrawl.dat. This will rescan that subvol so dedupe matches can be done, without forcing a rescan of every snapshot in the filesystem. That avoids much of the temporary space growth.

KeinNiemand commented 3 months ago

I have a few snapshots not a huge amount but a few (Did so manually every few month, now make 2 twice a month). I also have multiple sub volumes. So what's the best solution u

  1. Clearing the entire beeshome directory
  2. Deleting the hashtable
  3. Editing beescrawl.dat to set min_transid to 0 on all subvolumes (except maybe the snapshots?) you mentioned doing that for one subvolume so does that work for multiple?
Zygo commented 3 months ago

you mentioned doing that for one subvolume so does that work for multiple?

Yes, if you have subvol A, with snapshots A1, A2, A3, etc, and unrelated subvol B with snapshots B1, B2, B3, etc, you would set min_transid to 0 on subvol A and B, and leave A1, A2, A3, B1, B2, and B3 at their existing values.

KeinNiemand commented 1 month ago

I think that worked fine