Defragment then dedup - Githubissues

HaleTom commented 4 years ago

I notice that BTRFS dedupe calls the defrag ioctl before dedupe.

Does defragmentation currently occur with the way bees is written?

Would such a feature be possible? Would it work with RO snapshots?

Zygo commented 4 years ago

Currently bees does not do any defragmentation--in fact, bees adds fragmentation where required for better dedupe rates.

All of the current btrfs dedupe agents do not choose extents to dedupe based on fragmentation: duperemove and deduper don't consider extent structure, and the other deduplicators are file-oriented. bees does consider extent structure, but it uses that information to determine where extra fragmentation is needed to remove duplicate data blocks (the duplicate blocks are deleted, and an extra fragment is created for leftover unique data).

If you have a RO snapshot of some data (e.g. subvol A and snapshot A_ro), and you defragment subvol A, then all known btrfs deduplicators may replace extents in subvol A with references to the non-defragmented snapshot A_ro, which effectively undoes the defrag operation. In some deduplicators it's random which extents persist after dedupe; in bees, it's more likely the older extents survive (i.e. the fragmented ones).

In the future I intend to have bees handle defrag, dedupe, and btrfs garbage collection all at the same time, deciding on an individual extent basis whether it is better to combine the extent with its neighbors (defrag) or replace the extent with a reference to identical data (dedupe). That requires a rewrite of the existing code--it's different enough that I might even give the project a different name when it's done.

"defrag" and "dedupe" operations are opposites of each other. Running both over a large collection of files means each operation undoes some or all of the other's work. That said, you can certainly run defrag before running bees, and bees will put the required fragments back in while it removes the duplicates, but it will take much longer because everything is being read and written twice.

As of kernel 5.2, RO snapshots can be deduped in all cases; however, it is not possible to run btrfs send at the same time as dedupe is running on btrfs. Current kernels will reject the dedupe requests while send is running, older kernels will throw errors in the send operation or maybe even crash.

Zygo commented 4 years ago

Sorry, that should be "as of kernel 5.3..."

https://github.com/Zygo/bees/issues/115#issuecomment-515485921

HaleTom commented 4 years ago

Thanks a bundle for this detailed explanation! ❤️

Zygo commented 4 years ago

OK so it's 5.2.7 (the last patch has been backported to stable/linux-5.2.y)....

kakra commented 4 years ago

I wondered if it would be possible to watch the log (or some specialized socket pipeline for that purpose yet to be created) for the files that bees handled, then let a sibling process walk these files, check the extents and defragment/recombine all extents that are not shared (shared extent flag not set).

This would still not recombine extents deduped by bees but it would apply a delayed defrag after a file was written or modified: Such a delayed defrag process could analyze extent structure and rewrite all blocks that are either located far away from each other or split, and recombine them into fewer extents as long as those blocks don't belong to extents that are shared.

BTW: We could make it part of the bees project and call it "hive", as the bees come back and combine their efforts here. :-)

Zygo commented 4 years ago

It would be better to put that into a thread in bees, since it already has the extent structure cached, and bees knows the extents are unique because it created them. It's expensive to figure out if an extent is shared in btrfs--the kernel doesn't cache that information, and the cost is proportional to file size (among other multipliers).

Ideally bees would defer creation of temporary extents until we have enough of them to make large ones. Dedupe triggers a data flush on the src file, so it is better to cycle through collecting a few hundred MB of extent data that needs to be relocated, writing temporary files, and then replacing original data with references to the temporary files (as opposed to doing all three steps one at a time as bees does now). Once that component is in place, any data copied to temporary extents will be effectively defragged, a priority queue will sort the temporary extents in file order for output, and a helper thread can fill gaps between them.

kakra commented 4 years ago

so it is better to cycle through collecting a few hundred MB of extent data that needs to be relocated

I think the max extent size is 128M isn't it? So that amount should do. With HDDs today this isn't really a huge amount of data. Bees reaches IO throughputs of reading 200-300 MB/s on my system (on bcache with 4 HDDs draid0) according to htop without any impact on desktop performance, sometimes constantly for multiple minutes. BTW @Zygo Good work :-)

Zygo commented 4 years ago

I think the max extent size is 128M isn't it?

That's the absolute maximum, and the useful/practical maximum is much smaller, for one average output extent; however, on rotating media you probably want to batch up more than one temporary extent before you trigger allocation and move the heads...especially for compressed extents, which max out at 128K.

kakra commented 4 years ago

on rotating media you probably want to batch up more than one temporary extent

Ah okay that's the reasoning. Got it.

Zygo / bees

Defragment then dedup #121