BurntSushi / goim

Goim is a robust command line utility to maintain and query the Internet Movie Database (IMDb).
The Unlicense
117 stars 9 forks source link

md5 and collisions #1

Open BurntSushi opened 10 years ago

BurntSushi commented 10 years ago

Every entity from IMDb has a unique string, and this uniqueness is captured by an MD5 hash of the string. Goim assumes that there is precisely one unique MD5 hash for each entity in IMDb.

Using a hashing scheme like this admits to the possibility of a collision, which would violate a key invariant assumed by Goim.

However, given that the number of entities in IMDb is in the single-digit millions, I believe the probability of a collision is rare. (But feasible?)

A simple answer to this is to use a better hashing algorithm. (I initially chose md5 because it only requires 16 bytes of storage.)

A correct answer to this is to handle collisions. But I fear for the performance implications.