Option to detect hash collision

pcgalando commented 9 years ago

Hash collisions are unlikely when using SHA256, but as the number of files/chunks increases so does the probability, most de-duplication system do some sort to sanity check; file based - check file size and possibly type, block based - either generate fingerprint in addition to hash or read the block already in archive storage (ie: paranoid mode option in Symantec PureDisk) to verify the archive data really matches new data before incrementing the link counter and discarding the new data. Easy to test functionality if you are able to change the hash function to md4.

ThomasWaldmann commented 4 years ago

And "well enough" meant "well enough for our purposes" and the purpose is storing way less chunks than 2^256.

ThomasWaldmann commented 4 years ago

If my idea sounds valid / doable, I would open a ticket for "solving storage of detected hash collisions", meaning the EDCCs.

sebres commented 4 years ago

There is a minimum and maximum chunk size though, for technical reasons (storage efficiency, buffer size).

This does not really matter, because the max size alone make half of all possible blocks ever (5e10⁸ from 10e10⁸).

and the purpose is storing way less chunks than 2^256.

But as I already revealed above - the max size of chunk so the number of possible chunks (big K) as well as the quality of dataset are also important, especially considering constantly growing of number of stored chunks (small k), which significantly increases the impact of big K to chance of collision with every iteration over probability tree:

id-hash-colprob 32000    30 800
k = 30 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.41547846266991934

id-hash-colprob 32000    40 800
k = 40 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.6198046446107234

id-hash-colprob 320000   30 800
k = 30 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.42258853962267906

id-hash-colprob 320000   40 800
k = 40 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.6280580556352908

id-hash-colprob 32000000 30 800
k = 30 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.42336511057744053

id-hash-colprob 32000000 40 800
k = 40 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.6289545531811702

Accumulated to the table, the effect by increment of ALL possible (K) of certain size together with growing of input chunks (k) by constant hash size (big N = 800, small n variate due to K/N) or rather its dependency is pretty well obvious:

k	K	n	P(collision)
30	32000	40	0.41548
40	32000	40	0.61980
30	320000	400	0.42259
40	320000	400	0.62806
30	32000000	40000	0.42336
40	32000000	40000	0.62895

Note that by max size 2²¹ the number big K is enormous (5e10000000), and therefor n = K/N is extremely huge too - it is in about 256^2097152/2²⁵⁶ == 2^{216777216-256} that corresponds also to a number with 10M zeros.

So the "well enough" means either "well enough yet".

it sounds like you assume that the chunker cuts equal sized chunks if the file size is > 2^21

More interesting is the question what happens with two large files containing different content but building same hash Ha for first part of file 2^21 bytes, also if hashes for second part(s) Hb and Hc are different? So for first file we have Ha₁|Hb and for second file - Ha₂|Hc. Would borg assume that when Ha₁ == Ha₂, then this first part is the same chunk (and in result stores it only once)?

borgbackup / borg

Option to detect hash collision #170