Every entity from IMDb has a unique string, and this uniqueness is captured by an MD5 hash of the string. Goim assumes that there is precisely one unique MD5 hash for each entity in IMDb.
Using a hashing scheme like this admits to the possibility of a collision, which would violate a key invariant assumed by Goim.
However, given that the number of entities in IMDb is in the single-digit millions, I believe the probability of a collision is rare. (But feasible?)
A simple answer to this is to use a better hashing algorithm. (I initially chose md5 because it only requires 16 bytes of storage.)
A correct answer to this is to handle collisions. But I fear for the performance implications.
Every entity from IMDb has a unique string, and this uniqueness is captured by an MD5 hash of the string. Goim assumes that there is precisely one unique MD5 hash for each entity in IMDb.
Using a hashing scheme like this admits to the possibility of a collision, which would violate a key invariant assumed by Goim.
However, given that the number of entities in IMDb is in the single-digit millions, I believe the probability of a collision is rare. (But feasible?)
A simple answer to this is to use a better hashing algorithm. (I initially chose md5 because it only requires 16 bytes of storage.)
A correct answer to this is to handle collisions. But I fear for the performance implications.