Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
692 stars 56 forks source link

Deduplicating small files #117

Open ghost opened 5 years ago

ghost commented 5 years ago

Since normal deduplication works on block level, files smaller than 4KiB aren't considered (at least I think so). Could small files also be checked and reflinked? source trees and mailservers would benefit from this.

Zygo commented 5 years ago

By default btrfs will store small files (under 2K) as inline extents in metadata, i.e. the file appears on the disk next to its inode instead of in a separate disk block. Inline extents cannot be reflinked in btrfs, so they cannot be deduped. bees ignores inline extents. If you use the btrfs mount option max_inline=0 [1], small files will be stored in data blocks and then bees can deduplicate them.

btrfs requires that every byte of an extent must be removed from all subvol trees before the extent can be removed from the filesystem. The EOF block of a file might not be a multiple of 4K, but it still counts as a reference to the entire last extent. If any bytes are not removed, then no space occupied by the entire extent will be freed. bees will try to dedupe the entire extent when possible, but when this is not possible, then bees will perform a separate dedupe operation on the final <4K block. With max_inline=0, small files are simply EOF blocks without any previous data block in the same extent.

[1] I can't recommend doing this on btrfs in general because it will add a lot of seek overhead to accessing small files and require more storage for them...but it is a thing you can do on btrfs if you want to watch tiny file deduplication in action.

ghost commented 5 years ago

Thank you for the detailed explanation. I guess that there simply isn't any space savings to be done with small files. Would a small file with reflinks still take up a block?

Zygo commented 5 years ago

Yes, the minimum non-inline extent size is 4K (or PAGE_SIZE on CPUs where that is larger than 4K).