fwessels / HashCompare

Compare various different Hashing Algorithms
Apache License 2.0
21 stars 4 forks source link

mathemtical aspects #1

Open aead opened 6 years ago

aead commented 6 years ago

Requirements

Since we use a hash-based bitrot protection approach we need a (hash) function f mapping value of arbitrary length to a value of a fixed length. The function f must satisfy the following properties:

The 1. requirement is kind of obvious - if f is not deterministic it is not possible to reliable verify the integrity of data. The 2. requirement is derived from the strong collision requirement. However in contrast to the general birthday problem we are not interested in "constructed" collisions. Therefore we define a requirement violation as a collisions of randomly chosen inputs. A constructed collision would imply and intelligent adversary exploiting a mathematical weakness of a given mapping f to find collisions with a probability P' > P. However bitrot protection is not designed to withstand an attacker because the adversary can simple modify the data and replace the stored hash value with the hash value of the modified data.

Random Oracle and PRF

A cryptographic secure hash function - like BLAKE2b-512 - always satisfies our requirements since it provides strong collision resistance which does not restrict the input values to be chosen randomly. A random oracle and following a PRF (like SipHash / HighwayHash) also satisfies our requirements since a PRF can be constructed from our function f` using the following scheme:

  1. Generate a set of uniformly at random chosen keys. |ki| = |xi|
  2. yi = f(ki ⊕ xi)

This implies that SipHash, HighwayHash and BLAKE2b can be used for bitrot protection because of their mathematical properties. (Since we discarded Poly1305 anyway I don't discuss it here further)

However SipHash and HigwayHash are PRFs and require a secret key. The secret key can be securely omitted if the PRF provides strong collision resistance for a fixed (but unkown) secret key which is the case for SipHash and HighwayHash.

Choice of hash size

The following list shows the number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%:

Notice that this numbers refer to all minio instances. So for example if we talk about 5 billion objects with bitrot than we mean 5 billion distributed over all minio instances all over the world. Basically 50% chance of undetected bitrot on one minio server somewhere on the planet earth. (Except aliens are using minio :wink:)

fwessels commented 6 years ago

Couple of remarks

aead commented 6 years ago

since we only compare checksums within a single part (and not across parts, objects or let alone minio servers), if a bit rot would result in an equivalent checksum to the checksum of another part, this would not matter.

The checksum (in case of minio) is computed per 64 MB object part in case of PUT and per object part in case of multipart PUT. However for the computed probabilities this does not matter. This is not part of the mathematical considerations - See comment to 2. point.

if two different bit rots for the same part would generated the same checksum, but as long as it is different from the original checksum, we would have a 'collision' but it would not matter in minio's case.

That's why the Choice of hash size paragraph says:

number of data modifications that can happened before the probability for an undetected bitrot exceeds 50%

I don't not compute the probability for a collision of object hashes - I compute the probability for a collision of modified (bitrot occurred) object hashes. All numbers refer to objects which are already modified by bitrot. For example: "That would mean that the probability for one undetected bitrot is about 50% after 5100 million computations - roughly speaking 5 billion objects with bitrot in our case."

Whether bitrot affects every second object or only 1/100000000 does not affect the probability. It just affects the total number of objects that can be stored before a bitrot is not detect.

if encryption is used on top, then an undetected bit rot (if it were to happen) would fail the decryption.

That is correct. Of course there is also a probability that this could happen (that would be a message forgery) but as long as the chosen cryptographic primitives satisfies the claimed security properties this probability is negligible.

To show an example: If we would use SipHash-64 and the probability for bitrot would be 0.5 (so every second object would be affected) than the probability for an undetected bitrot is still: P = e-((k (k-1)) / (2 264))

So the probability for an undetected bitrot would exceed 50% after 232+229+228 corrupted objects. However since only every 2. object is affected we would be able to store 2 (232+229+228) before he probability for an undetected bitrot would exceed 50%. If instead of every 2nd object only every 1000nd object would be affected by bitrot we would be able to store 1000 (232+229+228) objects before the probability would exceed 50%.

fwessels commented 6 years ago

I see what you are saying, you are correct.

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

So this seems in line with the results mentioned above, let's discuss it with the team and make a decision.

BTW You don't think they use minio in space? :-)

aead commented 6 years ago

I went back to my Excel sheet and did some extrapolations as to when a certain bit length would get a collision with the following results in terms of number of permutations:

64 bit: 9.6E09 (9600 million) 128 bit: 3.3E20 256 bit: 2.5E40

Yes, this numbers sounds reasonable. I don't know the probability for bitrot (I guess it should be relatively small). So I think 64 bit is too short, 128 bit would be acceptable and 256 would be the conservative choice. So 128bit would probably satisfy most users expectations but I would recommend 256 bit to be sure.

BTW You don't think they use minio in space? :-)

Oh - your right I completely forgot the ISS and all the satellites, my bad 😉