Hash collisions (again) - universal solution for everyone?

git70 commented 1 year ago

Let me share some thoughts.

There has been much discussion (for example #170 #4884) about eliminating the non-zero probability of a SHA hash collision.

The developers here https://github.com/borgbackup/borg/issues/4884#issuecomment-566706071 felt that completely resolving the issue would have too much of an impact on Borg's performance, and implemented a solution to mitigate the risk, but still not definitively addressing the issue #181 Additionally (as far as I understand) comparing chunk size seems useless when using --chunker-params fixed,4194304

On the other hand, there are other options in Borg that also affect overall performance, but they do exist (for example, borg check --verify-data).

The key to the satisfaction of all parties (proponents and opponents of collision awareness https://github.com/borgbackup/borg/issues/4884#issuecomment-568159478) is OPTIONALITY

Additional benefits are here: https://github.com/restic/restic/issues/1620#issuecomment-545924018

Why not add this functionality as an OPTIONAL TO CHOOSE? (for example --verify-hash-collision). Just a warning that it affects (even heavily affects) performance. Such warnings are already in the documentation and are sufficient for users.

It is enough even here to add the information: "Attention! completely eliminate this problem by enabling --verify-hash-collision, but at the cost of performance degradation"

What do you think?

ThomasWaldmann commented 1 year ago

Before I'ld feel motivated to continue the hash collision discussion, I'ld like the following answers:

assuming that data1 and data2 are not identical, but have the same hash, how would you store data1 and data2? the hash is the key to a key/value store and of constant size, so that is not technically possible IMHO.
why would we even deal with trying to solve this issue if users have much bigger and much more probable problems?

git70 commented 1 year ago

assuming that data1 and data2 are not identical, but have the same hash, how would you store data1 and data2? the hash is the key to a key/value store and of constant size, so that is not technically possible IMHO.

I'll be honest: I don't know. (maybe a matter of additional checking of the checksum of the entire file or a binary comparison). I was only suggesting your statement here "Maybe we could implement that after multithreading is implemented and we have more cores to make use of. There is already another ticket about multithreading."

If the conclusion from all discussions (#170 and #4884, for example) is that there is no solution, then don't read any further and we can remove this entire ticket

Perhaps you misunderstood my intentions or my English is too weak, for which I apologize :( My goal is not to "foam" and open a new war of opponents and supporters of "collision awareness". I'm also not going to make that mistake in the FOSS world. The goal is only to try to diplomatically reconcile opponents and supporters by giving them a choice of options to enable or ignore themselves (assuming the problem is solvable according to your statement).

I'm not a mathematician and I'm not a programmer, but it seems to me that: A low probability of "something" happening doesn't mean "it will NOT happen tomorrow but at the EARLIEST in a million years"

It's not really a "math problem". This is a "psychological problem":

There is a proverb "If I'm unlucky, a brick will fall on my head in a wooden church"
If there is technically a way to change the probability of 0,0000000000000000000000000000000000000000000000000000000000043 to 0.00, why not use it? Provided of course it doesn't require an absurd amount of work and huge code changes, but I'm relying on your opinion: "Maybe we could implement that after multithreading is implemented and we have more cores to make use of."
There are many users who will simply sleep better knowing that this problem is completely eliminated (despite the existence of MANY other problems, but although THIS ONE is no longer there) Proof of this is the length of thread #170 :)
If the user has a million dollars in the BTC wallet and cannot sleep soundly knowing that this problem may be related to the wallet file - he will enable this option despite the performance decrease.
Of course, I don't claim that this option is "the most important in the world" (I personally am biting my nails waiting for the final release of v.2 and multithreading). But somewhere further down the line....(?)
In business psychology, marketing and PR (Public Relations), there's also the issue of Borg's reputation growth, and I want to uphold its good name: I imagine the comments "Among the most established programs, only Borg provides an option to prevent potential hash collisions, although these are incredibly unlikely."

I respect and appreciate your work and all Borg associates Cheers!

RonnyPfannschmidt commented 1 year ago

@git70 turning hash collision from improbable to impossible is commonly not a worthwhile endeavor as the risks and complications the solution requires may offset the utility in relation to the gain

Modern cryptographic hashes tend to be good for current content addressed storage needs at a wide scale

enkore commented 1 year ago

Borg should also check for fixed points in the encryption and add padding to avoid them.

ThomasWaldmann commented 1 year ago

@enkore not sure what you mean, but sounds like a different topic - can you make a separate ticket about it?

someone-somenet-org commented 1 year ago

@ThomasWaldmann: Imho the only useful thing would be to allow users to select something else than sha256, like sha512, sha3 or something completely different during repo creation.

@git70 - @ BTC-Wallet: the chance of colliding borg hashes for the wallet backup is the same as the chance that somebody randomly generates your private key for your btc - its afaik both 1 in 2^256. (Good luck sleeping now ;)

ThomasWaldmann commented 1 year ago

@someone-somenet-org The borg hashindex works with a fixed width of the key (256bits), so it is not easy to make that dynamic.

git70 commented 1 year ago

BTC-Wallet: the chance of colliding borg hashes for the wallet backup is the same as the chance that somebody randomly generates your private key for your btc - its afaik both 1 in 2^256. (Good luck sleeping now ;)

@someone-somenet-org So in this case my chance of hitting increases by 100% ;) (wallet shot + Borg shot is more than just a wallet shot)

ThomasWaldmann commented 1 year ago

Closing this, see https://github.com/borgbackup/borg/issues/7765#issuecomment-1685031045 .

mrichtarsky commented 1 week ago

Sorry for commenting in this long-ago closed issue. I was just comparing backup tools and also wondering how they handle the potential hash collision when deduplicating. FYI, your FAQ has me convinced and I'm glad that it is explained there.

However you wrote:

Before I'ld feel motivated to continue the hash collision discussion, I'ld like the following answers:

assuming that data1 and data2 are not identical, but have the same hash, how would you store data1 and data2? the hash is the key to a key/value store and of constant size, so that is not technically possible IMHO.

I think the point never was how to solve this technically and store the data. From my (certainly limited) understanding, the collision should be detected and then the backup process immediately stops (printing out the two lottery tickets).

why would we even deal with trying to solve this issue if users have much bigger and much more probable problems?

Not sure what you then meant here by "solving". In my mind, if it's just a bytewise compare of the two chunks and this can be done comfortably within the existing logic, it might be worth adding something like the suggested --verify-hash-collision. There might be environments where people want to use Borg and need to see that it handles this case as well, at least with an error on backup. If by solving you meant handling the hash collision and changing all the internal data structures, that's certainly a non-starter. But it's not needed since the chances of collision are so astronomical.

I agree that adding that option would have psychological benefits, just like you always should check your return codes (perhaps even if a call currently has no failing return codes, it can have in the future).

However, in that case I would also add to the docs:

If --verify-hash-collision reports a hash collision, DO NOT open a bug for it. The likelihood that it was caused by a hardware error or file system corruption on your machine are many orders of magnitude higher than a real hash collision we can fix.

mrichtarsky commented 1 week ago

P.S.: For completeness, ZFS deduplication also has a verify option:

https://docs.oracle.com/cd/E19120-01/open.solaris/817-2271/gjhav/index.html

If the property is set to verify and two blocks have the same signature, ZFS does a byte-for-byte comparison with the existing block to ensure that the contents are identical.

borgbackup / borg

Hash collisions (again) - universal solution for everyone? #7765