add a repository object header checksum - Githubissues

borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.

https://www.borgbackup.org/

Other

11.19k stars 742 forks source link

add a repository object header checksum #1704

Closed ThomasWaldmann closed 2 years ago

ThomasWaldmann commented 8 years ago

currently we only have a header+data checksum (crc32) in the stored repository object.

format: crc:32 + size:32 + tag:8 (+ chunkid:256)

this is a bit unfortunate if one wants to iterate over the objects without reading their data (but rather seek over it) - then we can't even be sure if the header data is valid.

ThomasWaldmann commented 2 years ago

as the repo format is dealt with by server-side code and the rpc protocol does not need to change for this, i guess this can be done quite easily:

the TAGs (PUT, DEL, COMMIT) for new format entries could be different than before (just use other type byte values)
newly written entries would be in new format
existing entries could be dealt with by borg check or borg upgrade (read old format from existing segments, write new format to new segments)
keep the code to read old and new format entries for one minor release (1.x), expect all users upgraded their repos until x+1. deprecate in x, remove in x+1.
borg check of an older borg version must not run on such a repo, it might kill all the new entries. Repository.version = 2
old clients (in a c/s setup) are acceptable as the repository part of borg check runs server side.

ThomasWaldmann commented 2 years ago

old format: crc:32 + size:32 + tag:8 (+ chunkid:256)
new format: header_hash:64   + tag:8 + size:32 (+ chunkid:256 + content_hash:256)

header_hash = H1(tag, size[, chunkid[, content_hash]])
content_hash = H2(content)

Options / Questions:

Hmm, do we need 64 bits for the header_hash? if 32 bits were enough (considering that it is now only for the header, which is max. 69 bytes), we could stay closer to the old format and have less overhead.
Can we just use blake2b or blake3 for H1 and H2? For the content_hash we want something very good and super fast, so guess that should be blake2b for now and blake3 after we adopted it.
Keep crc32 for the header? Performance does not matter, it is only over a few bytes.
Use ECC for the header? Would be cool if we could not only detect, but also correct header errors. E.g. a correct size value determines where the next entry starts in the segment file. Also it is interesting to know the correct chunkid. Guess ECC is only useful if it is significantly better than the ECC done already in the hardware (e.g. HDD/SSD controllers). Also, we would need a ECC lib that is suitable and maintained.

ThomasWaldmann commented 2 years ago

https://eklitzke.org/crcs-vs-hash-functions https://crypto.stackexchange.com/questions/32988/are-checksums-essentially-non-secure-versions-of-cryptographic-hashes

ThomasWaldmann commented 2 years ago

Guess we'll keep it simple, especially considering the code has to support both old and new for one borg version:

old format: crc:32 + size:32 + tag:8 + chunkid:256  # old PUT tag
crc = CRC32(size, tag, chunkid, content_data)  # problematic: content_data goes in here

new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256  # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash)  # good: crc can be quickly computed only for the header
content_hash = H(content_data)  # good: we can use something better here than crc32, e.g. sha256 or blake3.

DELETE and COMMIT tags stay as before.

ThomasWaldmann commented 2 years ago

While reviewing #6514, another idea came up:

new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256  # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash)  # good: crc can be quickly computed only for the header
hash = H(size, tag, chunkid, content_data)  # good: we can use something better here than crc32, e.g. sha256 or blake3.

For all use cases that actually read the content_data and check the hash, that would give us cryptographic hash strength (practically 100%) confidence against accidental corruption including the header values.

That would be good against corruption in the header not found by the crc32 check alone (e.g. multiple bit flips in the chunkid). It doesn't help if the content_data is checked very well, but we got the chunkid wrong...

ThomasWaldmann commented 2 years ago

@enkore @textshell @jdchristensen ^ can you review / comment, please?

ThomasWaldmann commented 2 years ago

Candidates for H:

256bit sha256 (only nice if hw accelerated, otherwise slow, old and proven quality)
256bit blake2b (good speed, even with pure sw implementation, we use it since long, so no surprises expected)
256bit blake3 (better speed, even with pure sw implementation, rather new)
128bit xxh3 (very high speed, even with pure sw implementation, we already use xxh64 at other places, not a cryptographic hash, but good collision resistance) - see also #6535
128bit xxh128 (very high speed, even with pure sw implementation, we already use xxh64 at other places, not a cryptographic hash, but good collision resistance) - see also #6535
64bit xxh64 (very high speed, even with pure sw implementation, we already use xxh64 at other places, not a cryptographic hash, but good collision resistance - as expected by birthday paradox)

Update: the PR code now uses xxh64.

ThomasWaldmann commented 2 years ago

about alignment:

hmm, if one assumed the whole buffer / bytes object was somehow aligned (not sure if python is doing that), that single-byte tag field would make it mis-aligned for all data after it, especially for the bigger amount of content data.

we compute a hash over that content data when reading.

when writing, the hash input is not contiguous in memory as the hash result needs to go in between header and content data (that could be fixed by appending the hash result to the content data instead of prepending it, but maybe we want to avoid creating that big python bytes object just for the sake of having all in one piece and rather do multiple writes / multiple hash updates).

so, is it worth to insert some padding in there after the tag, like 7 bytes of "reserved"?

ThomasWaldmann commented 2 years ago

will merge the PR soon:

no alignment
xxh64 as overall hash ("fast and good enough")