Open newgene opened 1 year ago
Hi Chunlei
I was thinking about this a bit.
I think there are some disadvantages to precomputing the similarities:
As an alternative, maybe we can compute a vector representation of every geneset and store it in an elasticsearch field, then do similarity searching on the fly.
A good algorithm for this could be MinHash, as it approximates the Jaccard similarity, but should be faster to calculate on the fly, and avoids the issue of having high dimensional vectors. Here is a python library for computing MinHash signatures: https://ekzhu.com/datasketch/minhash.html
Regarding integration with the API, it would require a custom script for computing the jaccard similarity from minhash signatures, and expose it through an endpoint...
I imagine the endpoint ( not sure if it should be the same as the query endpoint ) could take a specified geneset ID as input, and a similarity threshold as a parameter.
👍Good idea! Will have a closer look. If we are able to calculate it on the fly, that would be ideal.
Jaccard similarity score would a good option. See the example Python code here.
We can compute this score as a scheduled job, define a threshold and store the similar gene sets if any.
This could be useful especially for user gene sets to indicate possible duplications.