biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Query genes by alternative names #2

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

@cgreene suggested we look into mygene.info for Project Cognoma: https://github.com/cognoma/core-service/issues/29#issuecomment-252601701. My first impression is that this is a really awesome service that will help us a lot.

When I tried searching mygene.info/v3/query by gene name, no results were returned.

By name I mean that A1BG has the following Entrez Gene information:

Preferred Names

Names

Is this feature missing because biologists usually search by symbol? It seems like there would be many situations where name search would help you identify a gene you were interested in.

newgene commented 7 years ago

You can query by a gene symbol directly:

http://mygene.info/v3/query?q=A1BG

or, if you want the match on official symbol only:

http://mygene.info/v3/query?q=symbol:A1BG

Also note that, by default, the query service returns only matches from human, mouse, rat (because we included every gene-coding species, returning matches for all species by default does not fit most of our users' use cases.)

You can still get the matches for all species if you want:

http://mygene.info/v3/query?q=A1BG&species=all

Or, if you want a specific species:

http://mygene.info/v3/query?q=A1BG&species=mouse

Ref: http://docs.mygene.info/en/v3/doc/query_service.html#species

cgreene commented 7 years ago

@newgene : Would an autocomplete style search be expected to work on this field (or another one)?

dhimmel commented 7 years ago

@newgene, I'm asking about querying by gene name rather than symbol. A1BG is a symbol. alpha-1B-glycoprotein, HEL-S-163pA, and epididymis secretory sperm binding protein Li 163pA are names. Is querying by gene name supported?

Interestingly, https://mygene.info/v3/query?q=alpha-1B-glycoprotein returns:

{
  "total": 1,
  "took": 3,
  "max_score": 25.8868,
  "hits": [
    {
      "_id": "299963",
      "_score": 25.8868,
      "entrezgene": 299963,
      "name": "similar to alpha 1B-glycoprotein",
      "symbol": "RGD1564515",
      "taxid": 10116
    }
  ]
}

Which is missing the correct gene (entrezgene == 1), which has an exact match as name.

newgene commented 7 years ago

@dhimmel yes, I read your post too quickly, then I realized you were asking about querying by gene name, you actually need to query like this:

http://mygene.info/v3/query?q="alpha-1-B glycoprotein"

or

http://mygene.info/v3/query?q=name:"alpha-1-B glycoprotein"

Looks like the dash in your original query made the difference.

newgene commented 7 years ago

@cgreene this might be similar to what you need:

https://bitbucket.org/sulab/mygene.autocomplete/overview

Note that you can customize the query to what you need, like this line:

"q": "(symbol:{term} OR symbol: {term}* OR name:{term}* OR alias: {term}* OR summary:{term}*)",

dhimmel commented 7 years ago

@newgene got it. The preferred name section of the Entrez Gene website is confusing. The primary name for A1BG is "alpha-1-B glycoprotein". For some reason, the Entrez Gene webpage contains a field for preferred names that lists "alpha-1B-glycoprotein".

So it looks like MyGene gene queries search primary names but not alternatives. This issue is a feature request to also search the alternative names available in Entrez Gene.

cgreene commented 7 years ago

@newgene : What I'm really asking - is there an ngram tokenizer used for those fields? Trying to figure out if partial queries will return sensible matches. I searched for ngram_filter and didn't find anything in the source.

I poked around in this https://github.com/SuLab/mygene.info/blob/master/src/utils/es.py a bit, but I didn't find anything obvious right off hand and thought you might know.

dhimmel commented 7 years ago

@cgreene I think you're asking about partial search terms. For example, does https://mygene.info/v3/query?q=alpha-1-B%20glycoprot return a superset of the results that https://mygene.info/v3/query?q=alpha-1-B%20glycoprotein returns? It appears not, but I suggest you open a new issue, since this issue is for searching by alternate names.

cgreene commented 7 years ago

Good point @dhimmel. Opened #4 to focus on this.

newgene commented 7 years ago

@dhimmel I confirmed that those alternative names under "General protein information" section of NCBI A1BG are not included in current MyGene.info API. We will look into it to include them in our future release, then you should be able to return those hits using these alt. names.

newgene commented 7 years ago

@dhimmel @cgreene just want to let you guys know that we have now included those alternative names from NCBI for every gene object, under the field name "other_names":

http://mygene.info/v3/gene/1017?fields=other_names

and

http://mygene.info/v3/query?q=other_names:cyclin-dependent%20kinase

(note your original example gene 299963 has no alternative names any more from NCBI, so it currently has no other_names field)

For now, "other_names" field is not included in the unfielded query (like you pass a term directly to "q" without specifying a field), so you will need to explicitly add the field name prefix in the query. We can re-evaluate this based on user's feedback.

cgreene commented 7 years ago

Thanks @newgene! Cognoma decided to go with mygene.info for this service, so you may hear some more from us 👍

newgene commented 7 years ago

@cgreene Awesome! And you should hear us soon about a feature we are putting in to allow our users better customize their queries (like your auto-suggestion use cases)

dhimmel commented 7 years ago

@newgene thanks! Confirming the functionality based on the original example.

https://mygene.info/v3/query?q=other_names:HEL-S-163pA is returning (as expected):

{
  "total": 1,
  "max_score": 12.30816,
  "took": 14,
  "hits": [
    {
      "_id": "1",
      "_score": 12.30816,
      "entrezgene": 1,
      "name": "alpha-1-B glycoprotein",
      "symbol": "A1BG",
      "taxid": 9606
    }
  ]
}