brain-bican / taxonomy-development-tools

Tools to build and edit Cell Annotation Schema taxonomies.
Apache License 2.0
3 stars 1 forks source link

Add support for cell set hash IDs #29

Closed dosumis closed 8 months ago

dosumis commented 1 year ago

hashIDs derived by running a hash algorithm on sorted lists of cell_ids belonging to the cell set being tagged

TBD - do we make these into primary IDs, namespaced on taxonomy - or do they remain as secondary IDs used for tracking identity of cell sets?

dosumis commented 1 year ago

Email from @hkir-dev on candidates for short hash strings (suitable as primary IDs):

CRC-32 stands out as the most promising candidate. This algorithm generates 8-character long hashes, making it suitable for our requirements.

I created two repositories to demonstrate its usage in Python and R.

https://github.com/hkir-dev/hashing_algorithms_python https://github.com/hkir-dev/hashing_algorithms_r

In Python, you'll find that there are two convenient modules, namely zlib and binascii, which provide easy access to the CRC-32 algorithm. For R, we can utilize the digest package to achieve the same functionality.

If we generate 10.000 hashes per taxonomy, there is 1% hash collusion probability for a 32-bit algorithm. In case of a collusion, we can add an extra character (9 characters total) to solve the problem. In python, I added a unittest to demonstrate the collusion: https://github.com/hkir-dev/hashing_algorithms_python/blob/main/src/test/collusion_test.py

Based on your feedback and suggestions I can do further investigations.

Best,

Huseyin KIR

See Also: Hash collusion probability: https://preshing.com/20110504/hash-collision-probabilities/ Compare algorithms: http://www.sha1-online.com/ Adler-32 vs CRC-32: https://en.wikipedia.org/wiki/Adler-32#Advantages_and_disadvantages