borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.19k stars 742 forks source link

add a repository object header checksum #1704

Closed ThomasWaldmann closed 2 years ago

ThomasWaldmann commented 8 years ago

currently we only have a header+data checksum (crc32) in the stored repository object.

format: crc:32 + size:32 + tag:8 (+ chunkid:256)

this is a bit unfortunate if one wants to iterate over the objects without reading their data (but rather seek over it) - then we can't even be sure if the header data is valid.

ThomasWaldmann commented 2 years ago

as the repo format is dealt with by server-side code and the rpc protocol does not need to change for this, i guess this can be done quite easily:

ThomasWaldmann commented 2 years ago
old format: crc:32 + size:32 + tag:8 (+ chunkid:256)
new format: header_hash:64   + tag:8 + size:32 (+ chunkid:256 + content_hash:256)

header_hash = H1(tag, size[, chunkid[, content_hash]])
content_hash = H2(content)

Options / Questions:

ThomasWaldmann commented 2 years ago

https://eklitzke.org/crcs-vs-hash-functions https://crypto.stackexchange.com/questions/32988/are-checksums-essentially-non-secure-versions-of-cryptographic-hashes

ThomasWaldmann commented 2 years ago

Guess we'll keep it simple, especially considering the code has to support both old and new for one borg version:

old format: crc:32 + size:32 + tag:8 + chunkid:256  # old PUT tag
crc = CRC32(size, tag, chunkid, content_data)  # problematic: content_data goes in here

new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256  # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash)  # good: crc can be quickly computed only for the header
content_hash = H(content_data)  # good: we can use something better here than crc32, e.g. sha256 or blake3.

DELETE and COMMIT tags stay as before.
ThomasWaldmann commented 2 years ago

While reviewing #6514, another idea came up:

new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256  # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash)  # good: crc can be quickly computed only for the header
hash = H(size, tag, chunkid, content_data)  # good: we can use something better here than crc32, e.g. sha256 or blake3.

For all use cases that actually read the content_data and check the hash, that would give us cryptographic hash strength (practically 100%) confidence against accidental corruption including the header values.

That would be good against corruption in the header not found by the crc32 check alone (e.g. multiple bit flips in the chunkid). It doesn't help if the content_data is checked very well, but we got the chunkid wrong...

ThomasWaldmann commented 2 years ago

@enkore @textshell @jdchristensen ^ can you review / comment, please?

ThomasWaldmann commented 2 years ago

Candidates for H:

Update: the PR code now uses xxh64.

ThomasWaldmann commented 2 years ago

about alignment:

hmm, if one assumed the whole buffer / bytes object was somehow aligned (not sure if python is doing that), that single-byte tag field would make it mis-aligned for all data after it, especially for the bigger amount of content data.

we compute a hash over that content data when reading.

when writing, the hash input is not contiguous in memory as the hash result needs to go in between header and content data (that could be fixed by appending the hash result to the content data instead of prepending it, but maybe we want to avoid creating that big python bytes object just for the sake of having all in one piece and rather do multiple writes / multiple hash updates).

so, is it worth to insert some padding in there after the tag, like 7 bytes of "reserved"?

ThomasWaldmann commented 2 years ago

will merge the PR soon: