Open kakra opened 6 years ago
If we read the checksums directly out of the checksum tree, we can use them instead of reading and hashing blocks. 32 bits is too short for a hash--the collision rate after 16TB of unique data is too high, about 48-54 bytes are required for filesystems up to ~100TB; however, two consecutive 4K block CRC32s can be catenated to make a 64-bit hash, and that is long enough. If done correctly, bees won't need to read files at all while scanning them, only the reads during compare and dedup will be required. We'd lose some efficiency on compressed files, but I think the current design which tries to chop up compressed extents into smaller dedupable pieces isn't saving as much space as I'd like for the time it takes.
This stops us from deduping nodatasum extents at all, doesn't it? So we'd still need to read such extents manually and generate hashes then.
@Zygo wrote:
Yes, there would have to be some kind of fallback to internal sum generation to dedup when there are no-sum extents.
I think if we use TREE_SEARCH_V2 to test for sums, we have to read the sums into bees memory anyway, so we might as well do something useful with them. If the order was read-subvol-tree then read-csum-tree and we get csums, we can skip the block reads; otherwise, we proceed to read the blocks and generate sums.
[...]
The block size could still be 4K, just overlap the blocks (i.e. the DDT entries are (sum block 1, sum block 2), (sum block 2, sum block 3), (sum block 3, sum block 4), ...). We don't want to increase the alignment constraint or we lose the ability to dedup well inside VM images (we find only the half of the blocks that happen to match on 8K boundaries). Once we've found a pair of matching 4K-aligned 8K-length blocks, the alignment offset and extent boundaries are known, and we can use larger IO sizes for the final dedup.
Speaking of alignment, another drawback (and the killer reason why I haven't pursued reading the csum tree directly so far) is that we lose the ability to dedup compressed extents well. This is because the csums apply to the compressed data, not the uncompressed data. Uncompressed data and compressed data would have different csums. That's OK if your filesystem has a lot of e.g. unpacked source trees with lots of single-extent compressed files, but that's pretty much the only case that still works. One particular case that doesn't work is "log file vs. copy of the same log file" because the different writing patterns change the compressed extent boundaries, so the sums don't match when the data is identical.
On the other hand if you don't use compression then the drawback doesn't apply. All you lose is then is the ability to dedup 4K extents, and that's usually less than 1% of the total filesystem size.
I have a very different plan to deal with the performance of small (<12K) extents: batch them up into bigger extents in the extent rewriter so that they're not small any more. This effectively turns bees into a unified defrag+dedup tool, and counteracts the tendency for average extent size to get smaller as bees runs.
[...]
Update after a read of the kernel sources and some tests:
btrfs decides whether to read checksums or not by looking at the (hidden) NODATASUM inode flag. The 'nodatasum' mount option controls whether newly created files get this flag (and nodatacow determines the NODATASUM/NOCOW_FL flag state). That means once a file has been created, it's nodatasum or datasum (determined at creation time by the mount options) forever, regardless of what mount options are used in the future. There is one good reason for this: when we are reading files via the inode (the obviously common case, i.e. the POSIX read() function), we want to know if we have csums to verify without having to do the csum tree lookup every time. It's a bit messier for the relocation code (which starts from the extent and has to look up an inode much like bees does) but that's a less common case so it makes sense to be more awkward.
That means we should not patch the kernel, i.e. the restriction serves a useful purpose. bees needs to deal with this case.
Bees can read the NODATASUM flag directly from the btrfs inode, bypassing the GETFLAGS ioctl. bees already gets the raw inode data--it's the INODE_ITEM in the TREE_SEARCH_V2 results. We can't avoid reading those in the crawler, bees currently just throws the data away.
A NODATACOW file seems to be usable as a source file for a NODATASUM target in the extent rewriter. This would enable us to generate the two kinds of temp files required. The nodatasum flag can be turned off on empty files by clearing the NOCOW_FL flag with IOC_SETFLAGS, or turned on by setting the flag.
I merged the earlier patch for now, until a more complete solution can be developed. Are you up for making this change?
@Zygo wrote: