biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Create short user-friendly ids for user-created genesets #63

Closed ravila4 closed 1 year ago

ravila4 commented 1 year ago

User-created genesets currently have long random primary _id strings such as f78lQ4MBbFFuJZ6h9lGW, which are generated by default by Elasticsearch.

We could improve the readability and memorability of these ids by generating custom 5 or 6 character strings using, for example, Base62 characters.

Base 62 uniqueness metrics 5 chars in base 62 will give you 62^5 unique IDs = 916,132,832 (~1 billion) At 10k IDs per day you will be ok for 91k+ days 6 chars in base 62 will give you 62^6 unique IDs = 56,800,235,584 (56+ billion) At 10k IDs per day you will be ok for 5+ million days

Additionally, we might want to prepend a CURIE-style string to the beginning of our user-created IDs, this would allow us to easily differentiate the _ids of user-created genesets on sight.

A proposed string to use as a prefix for these ids is: mygst:. Using base62 strings of length 6, a typical _id would look like: mygst:UPrGAT

ravila4 commented 1 year ago

Some example algorithms for Base62 conversion: https://stackoverflow.com/questions/1119722/base-62-conversion#1119769