Doubts about the relationship between compression, deduplication, sending/receiving incremental snapshots.

Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent

GNU General Public License v3.0

647 stars 55 forks source link

Doubts about the relationship between compression, deduplication, sending/receiving incremental snapshots. #185

Open rankaiyx opened 3 years ago

rankaiyx commented 3 years ago

This is a beautiful tool! Thanks to the developers! I have some questions about the relationship between compression, deduplication, sending/receiving incremental snapshots.

1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated? Do I need to deduplicate data again on computer B?

2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B. Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?

3.When compression is enabled, does deduplication work on compressed or decompressed data? Does compression weaken deduplication? Can their combination maximize disk space savings? Should I use compression when I use deduplication?

I tried to find the answer on btrfs's official wiki, but failed. Is there a hierarchical abstract diagram of btrfs?

Thank you again for reading these a little too many questions.

kakra commented 3 years ago

1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated? Do I need to deduplicate data again on computer B?

Most probably you don't need to do that again but @Zygo may know better details.

2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B. Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?

Both computers A and B still have the non-deduplicated first read-only snapshot. You'd need to remove that old snapshot on both. The result should be a deduplicated second snapshot on both computers, which is similar to Q1 so the same may apply there: I'm not sure if it is an exact copy with the same shared extents or if it may deviate from the source original.

3.When compression is enabled, does deduplication work on compressed or decompressed data? Does compression weaken deduplication? Can their combination maximize disk space savings? Should I use compression when I use deduplication?

Bees may re-compress files, and I'm pretty sure it compares contents of the uncompressed files, so a mixed environment of compressed and non-compressed files won't worsen your dedup hitrate but there are other factors that apply here which may play into how well bees can work with compressed extents (because those are fixed at 128k size).

Depending on the source, deduplication tends to reach a much higher reduction in storage space than compression. If your data is highly deduplicatable, you may even not care about compression at all.

rankaiyx commented 3 years ago

1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated? Do I need to deduplicate data again on computer B?

Most probably you don't need to do that again but @Zygo may now better details.

2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B. Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?

Both computers A and B still have the non-deduplicated first read-only snapshot. You'd need to remove that old snapshot on both. The result should be a deduplicated second snapshot on both computers, which is similar to Q1 so the same may apply there: I'm not sure if it is an exact copy with the same shared extents or if it may deviate from the source original.

3.When compression is enabled, does deduplication work on compressed or decompressed data? Does compression weaken deduplication? Can their combination maximize disk space savings? Should I use compression when I use deduplication?

Bees may re-compress files, and I'm pretty sure it compares contents of the uncompressed files, so a mixed environment of compressed and non-compressed files won't worsen your dedup hitrate but there are other factors that apply here which may play into how well bees can work with compressed extents (because those are fixed at 128k size).

Depending on the source, deduplication tends to reach a much higher reduction in storage space than compression. If your data is highly deduplicatable, you may even not care about compression at all.

Thanks for your reply, it has increased my understanding, although it is not a very definite answer. If I can't get the definite details in the end, maybe I need to experiment.

Zygo commented 3 years ago

1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated?

Some of the data will be deduplicated.

btrfs send can replicate cloned extents; however, to keep kernel usage at sane levels, send has restrictions about how many references it will track and replicate. If the restrictions are exceeded, a simple copy command is emitted instead, and the receiver will have duplicate copies of extents where the sender has deduplicated references to a single extent. The result at the receiver is some point between the maximum and minimum possible deduplication. It is not trivial to estimate what that point will be for a given data set.

2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B. Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?

The second snapshot is equivalent to the result from the previous question: on B it will be somewhere between not deduplicated and fully deduplicated. The first snapshot is not deduplicated or modified in any way on computer B.

3.When compression is enabled, does deduplication work on compressed or decompressed data?

Deduplication works on compressed and uncompressed data interchangeably, i.e. duplicate uncompressed data blocks can be replaced by a reference to a compressed copy. Each extent in btrfs has a separate compression status, so files can contain a mix of compressed and uncompressed extents.

Does compression weaken deduplication?

Compressed data requires about 4x more hash table space on various test data sets. You can choose whether to increase the hash table size, or keep the hash table size and accept a lower dedupe hit rate.

Can their combination maximize disk space savings? Should I use compression when I use deduplication?

Yes and yes.

I tried to find the answer on btrfs's official wiki, but failed.

Note that the above answers about compression apply to bees. Other dedupers on btrfs handle compressed data very poorly or not at all.

rankaiyx commented 3 years ago

1.I deduplicated on computer A, then created a read-only snapshot and sent it to computer B. has the snapshot received by computer B been deduplicated?

Some of the data will be deduplicated.

btrfs send can replicate cloned extents; however, to keep kernel usage at sane levels, send has restrictions about how many references it will track and replicate. If the restrictions are exceeded, a simple copy command is emitted instead, and the receiver will have duplicate copies of extents where the sender has deduplicated references to a single extent. The result at the receiver is some point between the maximum and minimum possible deduplication. It is not trivial to estimate what that point will be for a given data set.

2.I did not deduplicate on computer A, then created a read-only snapshot and sent it to computer B. Then I deduplicate on computer A, create a new snapshot and send an incremental snapshot to computer B. when computer B receives this new snapshot, is it equivalent to deduplication?

The second snapshot is equivalent to the result from the previous question: on B it will be somewhere between not deduplicated and fully deduplicated. The first snapshot is not deduplicated or modified in any way on computer B.

3.When compression is enabled, does deduplication work on compressed or decompressed data?

Deduplication works on compressed and uncompressed data interchangeably, i.e. duplicate uncompressed data blocks can be replaced by a reference to a compressed copy. Each extent in btrfs has a separate compression status, so files can contain a mix of compressed and uncompressed extents.

Does compression weaken deduplication?

Compressed data requires about 4x more hash table space on various test data sets. You can choose whether to increase the hash table size, or keep the hash table size and accept a lower dedupe hit rate.

Can their combination maximize disk space savings? Should I use compression when I use deduplication?

Yes and yes.

I tried to find the answer on btrfs's official wiki, but failed.

Note that the above answers about compression apply to bees. Other dedupers on btrfs handle compressed data very poorly or not at all.

Thank you for your reply! It removes most of my doubts.

File data checksums are stored in a dedicated btree in a struct btrfs_csum_item. The offset of the key corresponds to the byte number of the extent. The data is checksummed after any compression or encryption is done and it reflects the bytes sent to the disk.

https://btrfs.wiki.kernel.org/index.php/Btrfs_design I read this sentence from this page, which shows that the block checksum of btrfs is generated after compression, is that so? Is that why other deduplication tools that rely on btrfs block checksums can't handle mixed compressed and uncompressed files? This may be one of the benefits of bees's self-built hash. I am curious about how to de-duplicate between uncompressed and compressed files. Btrfs allows part of the file to be compressed while the other part is not compressed?

kakra commented 3 years ago

I am curious about how to de-duplicate between uncompressed and compressed files. Btrfs allows part of the file to be compressed while the other part is not compressed?

Compression is per extent. Btrfs does not compress files like a zip file as it would be impossible to seek into the file then. That's why compressed extents are always 128k maximum size: It allows btrfs to seek near the position you'd like to read, and then decompress only a tiny part of the file to not have too much decompression overhead, then finally seek to the correct uncompressed position.

Actually, if btrfs starts out writing a file with compression, it will stop doing compression if it finds the compression ratio to be too low for a certain amount of data written. It also has fast statistical heuristics to check whether an extent could reach a useful compression ratio at all and would then just skip the compression step. This way, a file in btrfs naturally is a mixture of compressed and uncompressed extents.

rankaiyx commented 3 years ago

I am curious about how to de-duplicate between uncompressed and compressed files. Btrfs allows part of the file to be compressed while the other part is not compressed?

Compression is per extent. Btrfs does not compress files like a zip file as it would be impossible to seek into the file then. That's why compressed extents are always 128k maximum size: It allows btrfs to seek near the position you'd like to read, and then decompress only a tiny part of the file to not have too much decompression overhead, then finally seek to the correct uncompressed position.

Actually, if btrfs starts out writing a file with compression, it will stop doing compression if it finds the compression ratio to be too low for a certain amount of data written. It also has fast statistical heuristics to check whether an extent could reach a useful compression ratio at all and would then just skip the compression step. This way, a file in btrfs naturally is a mixture of compressed and uncompressed extents.

Suppose I have file A, and I enable compression and write it to btrfs disk. Then I append part B to the original A file, and I have the file AB. I disable compression and write it to btrfs disk. Now I use bees, can it refer to the data block of the compressed file A and replace the A part of the uncompressed file AB?

kakra commented 3 years ago

Now I use bees, can it refer to the data block of the compressed file A and replace the A part of the uncompressed file AB?

Bees doesn't look at files at all, it only cares about extents - so it replaces A-extents with A-extents, at it may not even prefer a compressed version over an uncompressed version. The reason why you see file names logged in bees is just because it needs to find a file referencing such an extent to get a file handle to actually read the contents. But it is really only about extents, not files.

Actually, bees may sometimes rewrite extent A to a new temporary file, breaking it up into shareable contents, and thus enable orphaned extent parts to be released from the file system. This may reset how and if the extent is compressed. Also, compressed extents do not work the may you imagine here: A compressed part of a file is made up of 128k extent chunks, this is why Zygo wrote that those will occupy a lot more meta data and hash table space.

To conclude: Yes, it will combine the A part into shared storage but it does not guarantee the direction of the operation, it may actually replace file A with the A part of the AB file. And if it decides to rewrite the A part to a temporary file, you may end up with changed compression mode - depending on whether or how you disabled compression, and depending on the kernel heuristics in btrfs for compression.

NobodyXu commented 3 years ago

Actually, bees may sometimes rewrite extent A to a new temporary file, breaking it up into shareable contents, and thus enable orphaned extent parts to be released from the file system. This may reset how and if the extent is compressed. Also, compressed extents do not work the may you imagine here: A compressed part of a file is made up of 128k extent chunks, this is why Zygo wrote that those will occupy a lot more meta data and hash table space.

Does this rewrite respects current compression mode set by mount options and btrfs?

I also found this in missing features:

When bees fragments an extent, the copied data is compressed. There is currently no way (other than by modifying the source) to select a compression method or not compress the data (patches welcome!).

This makes me wonder whether bees respects my choice of compression method or not.

NobodyXu commented 3 years ago

@Zygo I read that btrfs send currently uses v1, where every inode is simply read from disk, uncompressed before sending, the format seems to be similar to a tar.

And btrfs receive is entirely in userspace, using only system calls provided by btrfs.

This makes me wonder, is the snapshot received on the remote really deduplicated, or rather still contains duplicated data?

Zygo commented 3 years ago

"system calls provided by btrfs" includes the clone range system call, which creates a reflink extent. send will emit these instead of copies where it can, and when it does, the data is not duplicated on the receiver. v2 send format introduces more cases where clones are possible, but v1 send streams still include a lot of clone commands.

receive does exactly what send tells it to do. All of the intelligence is on the sending side.

send streams are serialized system calls and data packets. You can decode one with btrfs receive --dump and see what it contains. They bear little to no resemblance to tar files. They are more like a shell script creating files, setting attributes, and renaming them into place.

Compression method and level is determined by the mount option. If no mount option is provided, the btrfs default method is zlib and level is 3. bees will always compress when it rewrites an extent. Future versions of bees might try to match the original extent's compression method, or have a configurable compression method to use for extent rewrite.

NobodyXu commented 3 years ago

@Zygo Would adding an option in configuration for mount options in bees/beesd.in#L117 ensure that the deduplicated data is compressed using the method I want?

NobodyXu commented 3 years ago

It seems that the current v1 send still doesn’t handle deduplication optimally.

On my local machine with beese, two snapshots and the original, modifiable dir, it takes 27GB.

Using the same compression method and level, but without beese, the two snapshots take 31G.

Zygo commented 3 years ago

send definitely does not handle dedupe optimally. It will be somewhere between no deduplication at all (i.e. the total size of both snapshots) and full deduplication (i.e. the size of the original).