Closed pcgalando closed 9 years ago
And "well enough" meant "well enough for our purposes" and the purpose is storing way less chunks than 2^256.
If my idea sounds valid / doable, I would open a ticket for "solving storage of detected hash collisions", meaning the EDCCs.
There is a minimum and maximum chunk size though, for technical reasons (storage efficiency, buffer size).
This does not really matter, because the max size alone make half of all possible blocks ever (5e108 from 10e108).
and the purpose is storing way less chunks than 2^256.
But as I already revealed above - the max size of chunk so the number of possible chunks (big K
) as well as the quality of dataset are also important, especially considering constantly growing of number of stored chunks (small k
), which significantly increases the impact of big K
to chance of collision with every iteration over probability tree:
id-hash-colprob 32000 30 800
k = 30 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.41547846266991934
id-hash-colprob 32000 40 800
k = 40 cards from K = 32000 max, by sieve N = 1..800, repeated n = 40 times:
Calc-P(collision)=0.6198046446107234
id-hash-colprob 320000 30 800
k = 30 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.42258853962267906
id-hash-colprob 320000 40 800
k = 40 cards from K = 320000 max, by sieve N = 1..800, repeated n = 400 times:
Calc-P(collision)=0.6280580556352908
id-hash-colprob 32000000 30 800
k = 30 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.42336511057744053
id-hash-colprob 32000000 40 800
k = 40 cards from K = 32000000 max, by sieve N = 1..800, repeated n = 40000 times:
Calc-P(collision)=0.6289545531811702
Accumulated to the table, the effect by increment of ALL possible (K
) of certain size together with growing of input chunks (k
) by constant hash size (big N = 800, small n
variate due to K/N
) or rather its dependency is pretty well obvious:
k | K | n | P(collision) |
---|---|---|---|
30 | 32000 | 40 | 0.41548 |
40 | 32000 | 40 | 0.61980 |
30 | 320000 | 400 | 0.42259 |
40 | 320000 | 400 | 0.62806 |
30 | 32000000 | 40000 | 0.42336 |
40 | 32000000 | 40000 | 0.62895 |
Note that by max size 221 the number big K
is enormous (5e10000000), and therefor n = K/N
is extremely huge too - it is in about 2562097152/2256 == 2216777216-256 that corresponds also to a number with 10M zeros.
So the "well enough" means either "well enough yet".
it sounds like you assume that the chunker cuts equal sized chunks if the file size is > 2^21
More interesting is the question what happens with two large files containing different content but building same hash Ha for first part of file 2^21 bytes, also if hashes for second part(s) Hb and Hc are different? So for first file we have Ha1|Hb and for second file - Ha2|Hc. Would borg assume that when Ha1 == Ha2, then this first part is the same chunk (and in result stores it only once)?
Hash collisions are unlikely when using SHA256, but as the number of files/chunks increases so does the probability, most de-duplication system do some sort to sanity check; file based - check file size and possibly type, block based - either generate fingerprint in addition to hash or read the block already in archive storage (ie: paranoid mode option in Symantec PureDisk) to verify the archive data really matches new data before incrementing the link counter and discarding the new data. Easy to test functionality if you are able to change the hash function to md4.