Closed dosumis closed 8 months ago
Email from @hkir-dev on candidates for short hash strings (suitable as primary IDs):
CRC-32 stands out as the most promising candidate. This algorithm generates 8-character long hashes, making it suitable for our requirements.
I created two repositories to demonstrate its usage in Python and R.
https://github.com/hkir-dev/hashing_algorithms_python https://github.com/hkir-dev/hashing_algorithms_r
In Python, you'll find that there are two convenient modules, namely
zlib
andbinascii
, which provide easy access to the CRC-32 algorithm. For R, we can utilize thedigest
package to achieve the same functionality.If we generate 10.000 hashes per taxonomy, there is 1% hash collusion probability for a 32-bit algorithm. In case of a collusion, we can add an extra character (9 characters total) to solve the problem. In python, I added a unittest to demonstrate the collusion: https://github.com/hkir-dev/hashing_algorithms_python/blob/main/src/test/collusion_test.py
Based on your feedback and suggestions I can do further investigations.
Best,
Huseyin KIR
See Also: Hash collusion probability: https://preshing.com/20110504/hash-collision-probabilities/ Compare algorithms: http://www.sha1-online.com/ Adler-32 vs CRC-32: https://en.wikipedia.org/wiki/Adler-32#Advantages_and_disadvantages
hashIDs derived by running a hash algorithm on sorted lists of cell_ids belonging to the cell set being tagged
TBD - do we make these into primary IDs, namespaced on taxonomy - or do they remain as secondary IDs used for tracking identity of cell sets?