Closed ThomasWaldmann closed 2 years ago
as the repo format is dealt with by server-side code and the rpc protocol does not need to change for this, i guess this can be done quite easily:
borg check
or borg upgrade
(read old format from existing segments, write new format to new segments)Repository.version = 2
borg check
runs server side.old format: crc:32 + size:32 + tag:8 (+ chunkid:256)
new format: header_hash:64 + tag:8 + size:32 (+ chunkid:256 + content_hash:256)
header_hash = H1(tag, size[, chunkid[, content_hash]])
content_hash = H2(content)
Options / Questions:
Hmm, do we need 64 bits for the header_hash? if 32 bits were enough (considering that it is now only for the header, which is max. 69 bytes), we could stay closer to the old format and have less overhead.
Can we just use blake2b or blake3 for H1 and H2? For the content_hash we want something very good and super fast, so guess that should be blake2b for now and blake3 after we adopted it.
Keep crc32 for the header? Performance does not matter, it is only over a few bytes.
Use ECC for the header? Would be cool if we could not only detect, but also correct header errors. E.g. a correct size value determines where the next entry starts in the segment file. Also it is interesting to know the correct chunkid. Guess ECC is only useful if it is significantly better than the ECC done already in the hardware (e.g. HDD/SSD controllers). Also, we would need a ECC lib that is suitable and maintained.
Guess we'll keep it simple, especially considering the code has to support both old and new for one borg version:
old format: crc:32 + size:32 + tag:8 + chunkid:256 # old PUT tag
crc = CRC32(size, tag, chunkid, content_data) # problematic: content_data goes in here
new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256 # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash) # good: crc can be quickly computed only for the header
content_hash = H(content_data) # good: we can use something better here than crc32, e.g. sha256 or blake3.
DELETE and COMMIT tags stay as before.
While reviewing #6514, another idea came up:
new format: crc:32 + size:32 + tag:8 + chunkid:256 + content_hash:256 # new PUT2 tag
crc = CRC32(size, tag, chunkid, content_hash) # good: crc can be quickly computed only for the header
hash = H(size, tag, chunkid, content_data) # good: we can use something better here than crc32, e.g. sha256 or blake3.
For all use cases that actually read the content_data and check the hash, that would give us cryptographic hash strength (practically 100%) confidence against accidental corruption including the header values.
That would be good against corruption in the header not found by the crc32 check alone (e.g. multiple bit flips in the chunkid). It doesn't help if the content_data is checked very well, but we got the chunkid wrong...
@enkore @textshell @jdchristensen ^ can you review / comment, please?
Candidates for H:
Update: the PR code now uses xxh64.
about alignment:
hmm, if one assumed the whole buffer / bytes object was somehow aligned (not sure if python is doing that), that single-byte tag
field would make it mis-aligned for all data after it, especially for the bigger amount of content data.
we compute a hash over that content data when reading.
when writing, the hash input is not contiguous in memory as the hash result needs to go in between header and content data (that could be fixed by appending the hash result to the content data instead of prepending it, but maybe we want to avoid creating that big python bytes object just for the sake of having all in one piece and rather do multiple writes / multiple hash updates).
so, is it worth to insert some padding in there after the tag, like 7 bytes of "reserved"?
will merge the PR soon:
currently we only have a header+data checksum (crc32) in the stored repository object.
format: crc:32 + size:32 + tag:8 (+ chunkid:256)
this is a bit unfortunate if one wants to iterate over the objects without reading their data (but rather seek over it) - then we can't even be sure if the header data is valid.