biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

identify similar gene sets via similarity scores #96

Open newgene opened 1 year ago

newgene commented 1 year ago

Jaccard similarity score would a good option. See the example Python code here.

We can compute this score as a scheduled job, define a threshold and store the similar gene sets if any.

This could be useful especially for user gene sets to indicate possible duplications.

ravila4 commented 1 year ago

Hi Chunlei

I was thinking about this a bit.

I think there are some disadvantages to precomputing the similarities:

  1. The user may want to customize the similarity threshold
  2. The list of similar genesets would not account for newly created genesets.
  3. Potentially high computational costs for recomputing an NxN similarities with every refresh.

As an alternative, maybe we can compute a vector representation of every geneset and store it in an elasticsearch field, then do similarity searching on the fly.

A good algorithm for this could be MinHash, as it approximates the Jaccard similarity, but should be faster to calculate on the fly, and avoids the issue of having high dimensional vectors. Here is a python library for computing MinHash signatures: https://ekzhu.com/datasketch/minhash.html

ravila4 commented 1 year ago

Regarding integration with the API, it would require a custom script for computing the jaccard similarity from minhash signatures, and expose it through an endpoint...

I imagine the endpoint ( not sure if it should be the same as the query endpoint ) could take a specified geneset ID as input, and a similarity threshold as a parameter.

newgene commented 1 year ago

👍Good idea! Will have a closer look. If we are able to calculate it on the fly, that would be ideal.