Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

size of data dedupped? #101

Open dim-geo opened 5 years ago

dim-geo commented 5 years ago

Hi, how can I identify the data dedupped so far? What field describes that in BEESSTATUS? I would propose to document BEESSTATUS...

kakra commented 5 years ago

It's in "TOTAL", counter "dedup_bytes" as far as I can tell. But it may not reflect the real value of bytes saved on-disk due to how btrfs internally handles shared reflinks, and because you may have broken the reflinks again by writing to such extents. So you should also compare "df" before and after. Also, the status file may not be preserved during restarts of the bees service.

Zygo commented 5 years ago

In order to dedupe an extent, bees must present all references to the extent (e.g. from snapshots, clones, and previous dedupes by other tools) to the dedupe ioctl where they will be deleted one at a time. btrfs will automatically delete the data extent when the last reference to the data is deleted. dedup_bytes counts the number of bytes presented to the dedupe ioctl, which can be quite different from the number of bytes removed. dedup_bytes also counts non-space-freeing operations, like temporary data copies to split mixed extents into all-unique and all-dupilicate extents, and dedupe attempts that fail due to mismatched data caused by concurrent data modification.

bees presents each individual extent reference to the dedupe ioctl as it is detected, and does not detect when the last reference to extent data is deleted--btrfs already does that, so bees doesn't have to. This means bees currently has no reliable way to know when it freed any space.

If you've been running bees for a while, compsize may give you a better idea than df how much dedupe has occurred, especially if you are constantly filling the free space with more data. On the other hand compsize works by brute-force mapping the entire filesystem, so it can take a while.