biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
116 stars 20 forks source link

improve default sorting behavior for antisense genes #152

Closed andrewsu closed 3 months ago

andrewsu commented 3 months ago

As reported to the help email:

---------- Forwarded message --------- Date: Wed, Jul 31, 2024 at 10:44 PM Subject: Antisense RNAs [snip] I did, however, come across something that seems like a bug, and I wanted to give you a heads-up. I was recently using your API to get all known aliases for approved gene symbols (HGNC) and I got puzzled by some results I was getting. It seems that antisense RNAs get higher scores and thus land at the top of the hit list instead of the sense genes they correspond to (see enclosed response for CTNNA2 with CTNNA2-AS1 being the top hit). In my search I got this kind of result for almost 10% of queries.

I confirmed the non-ideal sorting behavior here https://mygene.info/v3/query?q=CTNNA2&species=human (the third result for CTNNA2 should come first):

{
  "took": 5,
  "total": 3,
  "max_score": 104.61278,
  "hits": [
    {
      "_id": "ENSG00000229385",
      "_score": 104.61278,
      "name": "CTNNA2 antisense RNA 1",
      "symbol": "CTNNA2-AS1",
      "taxid": 9606
    },
    {
      "_id": "101927987",
      "_score": 104.61278,
      "entrezgene": "101927987",
      "name": "CTNNA2 antisense RNA 1",
      "symbol": "CTNNA2-AS1",
      "taxid": 9606
    },
    {
      "_id": "1496",
      "_score": 87.88892,
      "entrezgene": "1496",
      "name": "catenin alpha 2",
      "symbol": "CTNNA2",
      "taxid": 9606
    }
  ]
}

By searching for "antisense" as a keyword, we can find many other examples (and likely this applies to all ~12k results):

I will counsel the reporter that a fielded search (e.g., https://mygene.info/v3/query?q=symbol:CTNNA2&fields=alias,symbol,taxid) would be useful here, but I think there is definitely an opportunity here to improve our default sorting behavior.

jal347 commented 3 months ago

I have increased the weight for symbols in our search query. I have double checked with the examples you have given. Let me know if you have any issues.