bcpierce00 / unison

Unison file synchronizer
GNU General Public License v3.0
4.17k stars 235 forks source link

BTRFS: Use its checksums when available #486

Closed L3P3 closed 2 years ago

L3P3 commented 3 years ago

BTRFS already includes checksums in the metadata. I use it on all my linux devices for years and it is very reliable and fast. It gets more and more popular and I think getting zero-cost checksums is a very great advantage, especially for huge files. If at least one side uses btrfs, btrfs checksums should be used, using btrfs' hash algorithm on both sides.

To obtain btrfs' checksums, this approach should work perfectly: https://stackoverflow.com/questions/32761299/btrfs-ioctl-get-file-checksums-from-userspace

I have no experience with ocaml whatsoever so it would be surely much easier for someone else to implement this. Maybe a nice weekend project. :-)

tleedjarv commented 3 years ago

The idea is interesting and there are other checksumming filesystems out there.

I do see a few issues that may or may not be easily overcome.

Generalizing on the idea, Unison could delegate checksumming to an external process (similar to fsmonitor). There could be a configurable file size threshold, which determines if internal or external checksumming will be used. It wouldn't matter how the external process produces the checksum, leveraging zero-cost checksums from filesystem or not. Of course, Unison archive format must still change to record the algorithm, but this way Unison would not have to care about filesystem specifics or algorithm implementations.

Further discussions would normally be taken to the developers mailing list.

gdt commented 3 years ago

My view is that there are a vast number of things to do in unison and this one, while interesting from a theoretical viewpoint, has a lot of things to figure out to make it really work. As I see things, it's not that important, in that it seeks to solve a problem no one is reporting on the user's list. So, absent someone who cares writing up a full design rationale that addresses all the hard questions, I don't see anything happening, and I'm just barely ok with leaving speculative things like this in the tracker. Even with a rationale, it seems likely that it's too complicated for the benefit -- right now we're not even getting code reviews from people about ocaml 4.12 compat changes.

L3P3 commented 3 years ago

@tleedjarv my idea is that indeed the block-level hashes are again hashed to file-hashes. The fs checksums should only be used to determine which blocks have changed, the traditional rsync algorithm should not change at all beside pointing it towards the proper files and blocks to look at. @gdt I agree that there are many more important things to do before doing deep-cutting architectural changes like I propose here. Please keep this issue but it is not really urgent.

gdt commented 2 years ago

No PR or design has arrived. :-(

The point about using a debug tool is a serious one. I am not willing to consider relying on interfaces that are not documented as stable and usable without a very good reason.

Having reread, I think the real issue is that using BTRFS checksums if available would require:

For now I do not believe that a design with a good cost-benefit is possible. Feel free to post on unison-hackers a complete design that analyzes operations in mixed environments, including older Unison. But please understand that I am very skeptical; believe that complexity has a cost that must be outweighed by benefits. I see a lot of complexity and very small benefits.