md5 and collisions - Githubissues

Every entity from IMDb has a unique string, and this uniqueness is captured by an MD5 hash of the string. Goim assumes that there is precisely one unique MD5 hash for each entity in IMDb.

Using a hashing scheme like this admits to the possibility of a collision, which would violate a key invariant assumed by Goim.

However, given that the number of entities in IMDb is in the single-digit millions, I believe the probability of a collision is rare. (But feasible?)

A simple answer to this is to use a better hashing algorithm. (I initially chose md5 because it only requires 16 bytes of storage.)

A correct answer to this is to handle collisions. But I fear for the performance implications.

BurntSushi / goim

md5 and collisions #1