biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Gene query scores hits poorly #3

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

https://mygene.info/v3/query?q=A1BG returns

{
  "total": 4,
  "took": 3,
  "max_score": 27.817915,
  "hits": [
    {
      "_id": "503538",
      "_score": 27.817915,
      "entrezgene": 503538,
      "name": "A1BG antisense RNA 1",
      "symbol": "A1BG-AS1",
      "taxid": 9606
    },
    {
      "_id": "117586",
      "_score": 9.105442,
      "entrezgene": 117586,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1bg",
      "taxid": 10090
    },
    {
      "_id": "140656",
      "_score": 5.982859,
      "entrezgene": 140656,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1bg",
      "taxid": 10116
    },
    {
      "_id": "1",
      "_score": 5.10959,
      "entrezgene": 1,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1BG",
      "taxid": 9606
    }
  ]
}

The first result returned is actually the worst result. The last result is the best (full symbol match with no case differences). So what determines the order of genes in hits. I expected they would be ordered by relevance/score. If the ordering doesn't reflect score, then shouldn't a score field be included for each hit?

The use case I have in mind is that a user will submit a query and we will display the resulting hits in order of best to worst, like Google does for searches.

dhimmel commented 7 years ago

Sorry I missed the _score field. I crossed out the part of my above comment where I overlooked the _score field.

So let's focus this issue on the unexpected scoring for the above query.

newgene commented 7 years ago

@dhimmel In this default query (with a field prefix), we gave symbol match more weight than other field match (like name, alias), as you can see the query we built here:

https://github.com/SuLab/mygene.info/blob/master/src/utils/es.py#L300

however, in this case, "A1BG" in the first gene hit seems appearing multiple times than other gene hits, which gave it much higher scores, even the weighting on symbol field did not make other hits over this one:

https://mygene.info/v3/query?q=A1BG&fields=name,symbol,alias,summary

We can fine-tune how we do the weighting on this default query, so that the returned gene hits will be ranked closer to what we expected.

For now, one alternative for you is to make your own customized query like this:

q=symbol:A1BG^10 OR name:A1BG OR alias:A1BG OR summary:A1BG^0.1

As you can see, you can set your own weighting at the query time.

newgene commented 7 years ago

@dhimmel We can now confirm that there is something wrong with the weighting on our v3 API, which caused the ranking is not what we expect (basically, the weighting does not take effect). On my v2 API, the hits are ranked as expected (symbol matches go first, and human genes go first than mouse, rat and the rest, etc.):

https://mygene.info/v2/query?q=A1BG

We are now looking into the issue (probably related to the Elasticsearch version upgrade) and have this fixed ASAP.

dhimmel commented 7 years ago

For now, one alternative for you is to make your own customized query like this:

q=symbol:A1BG^10 OR name:A1BG OR alias:A1BG OR summary:A1BG^0.1

@newgene for the customized query, how should we encode queries with spaces or wildcards in them? For example, how would we search for alpha-1-B glycopro* in the symbol, name, or alias with custom weights? I'm struggling with how the URI encoding should be performed.

newgene commented 7 years ago

@dhimmel Ok, we have now get to the bottom of this issue. It was a lowercase filter (for query term) is not in place, which we had for our previous v2 API. This caused https://mygene.info/v3/query?q=A1BG does not return the right order of the hits, but the lower case query term returns the correct order:

https://mygene.info/v2/query?q=a1bg

We have now fixed the problem, so the both query above returns the hits in exactly same order.