Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
647 stars 55 forks source link

Bees scanning file over and over #157

Closed mischaelschill closed 3 years ago

mischaelschill commented 3 years ago

I have fairly large VM images in snapshots on the drive running bees. I constantly grep the .status file for "Scanning " to see where the crawler is currently at. I discovered that the crawler scans the same image file (in the same snapshot) over and over, even though it is not in use and therefore not changing. It does so even in the (read-only) snapshots. Is this expected behavior? It seems it never gets to the point where it looks at another file, although a full scan of the file requires hours and I haven't yet observed it for days.

I use bees v0.6.3 on a btrfs volume with 6 (ssd) disks in raid1 and kernel 5.9.9

mischaelschill commented 3 years ago

I just found out that it was actually different files named so similarily that I erroneously thought they to be one and the same

kakra commented 3 years ago

Also, it doesn't scan files but extents. As long as changes to the same file add new extents (which is normal in btrfs because it is cow), bees will scan that data: This data is actually new, it won't scan the same file over and over again.

mischaelschill commented 3 years ago

It does seem to have to "scan" the file for the new extents, even if it does not read the data. These files might well have beyond a hundred thousand extents each.

Zygo commented 3 years ago

btrfs provides a mechanism for scanning only "new" metadata in subvols; however, the resolution of "new" is a metadata page, each of which can hold references to up to 300 extents. So if you have a big file, and you're modifying random blocks, and you modify more than 0.3% of the file between scans, then all of the metadata pages are "new" and show up in the next scan's metadata stream. (edit: the overhead for this is pretty low, since we can eliminate the 99.7% of old extents very quickly and scan only the 0.3% new ones; however, it will increase the crawl time).

The "Scanning" status usually implies reading the blocks at the same time. By the time we get to "scanning" we already know the location of a new data block to be read. The metadata search where btrfs tells us where to find the new data is called "crawling" in bees, and it usually runs far too fast to see (milliseconds per hour).

If the dedupe splits extents into unique and non-unique extents, then the next pass over the subvol will scan the new unique extents. These will have the same locations within the file, but different physical locations on disk.

Zygo commented 3 years ago

oops, wrong button ;)