Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
655 stars 55 forks source link

RFE: Improve checksum lookup and support nodatasum #31

Open kakra opened 6 years ago

kakra commented 6 years ago

@Zygo wrote:

Some more thoughts...

We can identify nodatasum extents fairly easily--just look up the extent bytenr in the csum tree, and the sums should be there or not. This would allow the extents to be marked differently to avoid attempting to dedup them with the other kind of extent. It's a little more complicated than that (e.g. when bees grows matching pairs of extents, the src might spill over onto differently-summed extents) but still doable. This case comes up for toxic extents too, although in the toxic extent case we just throw the exception and move on, while in the nodatasum case we would need to stop growing and continue with the dedup.

The temporary file class could just have two instances, one for datasum extents, one for non-datasum extents, and the extent rewriter could use the appropriate one for a match.

Meh, or we just patch the kernel to allow these dedups. But then again it might be the right thing to do to support the two kinds of extent as separate duplicate classes. I keep going back and forth on this.

Bees reserves the bottom 12 bits of physical addresses in the hash table for various flags: 6 bits for compressed offsets, 1 for compression, 1 for unaligned EOF, 1 for toxic (although in practice there are so few toxic extents it makes more sense to just save them in a small text file instead of using a whole bit per DDT entry for them). That leaves 2 (3) more. Lots of bits are 0 in btrfs physical addresses because they're byte offsets on a filesystem with at least a 4K block size (so the bottom 12 bits are effectively always 0, as well as several at the top). There's no need to steal bits from the hash yet.

If we read the checksums directly out of the checksum tree, we can use them instead of reading and hashing blocks. 32 bits is too short for a hash--the collision rate after 16TB of unique data is too high, about 48-54 bytes are required for filesystems up to ~100TB; however, two consecutive 4K block CRC32s can be catenated to make a 64-bit hash, and that is long enough. If done correctly, bees won't need to read files at all while scanning them, only the reads during compare and dedup will be required. We'd lose some efficiency on compressed files, but I think the current design which tries to chop up compressed extents into smaller dedupable pieces isn't saving as much space as I'd like for the time it takes.

while the extent rewriter achieves much much higher io rates

Note that the extent rewriter almost always re-reads something that was very recently read by the scanner, so the data will still be in the VFS page cache (unless you are really low on memory). Extent rewrite typically has near-zero read overhead because the scanner does all the heavy lifting on reads, and the writes from dedup tend to get paid for by LOGICAL_INO and TREE_SEARCH_V2 requests that follow (or any other ioctl on the filesystem that joins a transaction and thereby waits for a commit and/or flush).

I am planning to do a rewrite that caches more stuff on reads and defers more writes to dedicated threads, partly to improve IO performance, but mostly to avoid bisecting extents almost all the time.

kakra commented 6 years ago

If we read the checksums directly out of the checksum tree, we can use them instead of reading and hashing blocks. 32 bits is too short for a hash--the collision rate after 16TB of unique data is too high, about 48-54 bytes are required for filesystems up to ~100TB; however, two consecutive 4K block CRC32s can be catenated to make a 64-bit hash, and that is long enough. If done correctly, bees won't need to read files at all while scanning them, only the reads during compare and dedup will be required. We'd lose some efficiency on compressed files, but I think the current design which tries to chop up compressed extents into smaller dedupable pieces isn't saving as much space as I'd like for the time it takes.

This stops us from deduping nodatasum extents at all, doesn't it? So we'd still need to read such extents manually and generate hashes then.

kakra commented 6 years ago

@Zygo wrote:

Yes, there would have to be some kind of fallback to internal sum generation to dedup when there are no-sum extents.

I think if we use TREE_SEARCH_V2 to test for sums, we have to read the sums into bees memory anyway, so we might as well do something useful with them. If the order was read-subvol-tree then read-csum-tree and we get csums, we can skip the block reads; otherwise, we proceed to read the blocks and generate sums.

[...]

The block size could still be 4K, just overlap the blocks (i.e. the DDT entries are (sum block 1, sum block 2), (sum block 2, sum block 3), (sum block 3, sum block 4), ...). We don't want to increase the alignment constraint or we lose the ability to dedup well inside VM images (we find only the half of the blocks that happen to match on 8K boundaries). Once we've found a pair of matching 4K-aligned 8K-length blocks, the alignment offset and extent boundaries are known, and we can use larger IO sizes for the final dedup.

Speaking of alignment, another drawback (and the killer reason why I haven't pursued reading the csum tree directly so far) is that we lose the ability to dedup compressed extents well. This is because the csums apply to the compressed data, not the uncompressed data. Uncompressed data and compressed data would have different csums. That's OK if your filesystem has a lot of e.g. unpacked source trees with lots of single-extent compressed files, but that's pretty much the only case that still works. One particular case that doesn't work is "log file vs. copy of the same log file" because the different writing patterns change the compressed extent boundaries, so the sums don't match when the data is identical.

On the other hand if you don't use compression then the drawback doesn't apply. All you lose is then is the ability to dedup 4K extents, and that's usually less than 1% of the total filesystem size.

I have a very different plan to deal with the performance of small (<12K) extents: batch them up into bigger extents in the extent rewriter so that they're not small any more. This effectively turns bees into a unified defrag+dedup tool, and counteracts the tendency for average extent size to get smaller as bees runs.

[...]

Update after a read of the kernel sources and some tests:

btrfs decides whether to read checksums or not by looking at the (hidden) NODATASUM inode flag. The 'nodatasum' mount option controls whether newly created files get this flag (and nodatacow determines the NODATASUM/NOCOW_FL flag state). That means once a file has been created, it's nodatasum or datasum (determined at creation time by the mount options) forever, regardless of what mount options are used in the future. There is one good reason for this: when we are reading files via the inode (the obviously common case, i.e. the POSIX read() function), we want to know if we have csums to verify without having to do the csum tree lookup every time. It's a bit messier for the relocation code (which starts from the extent and has to look up an inode much like bees does) but that's a less common case so it makes sense to be more awkward.

That means we should not patch the kernel, i.e. the restriction serves a useful purpose. bees needs to deal with this case.

Bees can read the NODATASUM flag directly from the btrfs inode, bypassing the GETFLAGS ioctl. bees already gets the raw inode data--it's the INODE_ITEM in the TREE_SEARCH_V2 results. We can't avoid reading those in the crawler, bees currently just throws the data away.

A NODATACOW file seems to be usable as a source file for a NODATASUM target in the extent rewriter. This would enable us to generate the two kinds of temp files required. The nodatasum flag can be turned off on empty files by clearing the NOCOW_FL flag with IOC_SETFLAGS, or turned on by setting the flag.

I merged the earlier patch for now, until a more complete solution can be developed. Are you up for making this change?