biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Allow for exact matches only? #41

Open buchanae opened 6 years ago

buchanae commented 6 years ago

Querying something like "BRCA1", I get a lot of seemingly unrelated matches such as "BRAT1".

This is obviously a symptom of the nature of ElasticSearch. In analytical use cases, personally, I think fuzzy matches are dangerous.

Could we add a query parameter to require an exact match? Or maybe it exists and I'm not seeing the docs?

newgene commented 6 years ago

@buchanae general query like q=BRCA1 will match multiple fields, like symbol, name, .... But fuzzy matches are not used. The match of "BRAT1" gene is because "BRCA1" is mentioned in its gene name.

You can get exactly what you need by using the fielded query:

q=symbol:BRCA1

or limited to human only:

q=symbol:BRCA1&species=human

buchanae commented 6 years ago

Ah, ok, thanks!

I actually can't even reproduce the results I mentioned now. Wish I had posted the query.

These are the queries I tried this morning: https://gist.github.com/buchanae/5cba60894e190c35da1ac3e1c7e5e511

buchanae commented 6 years ago

Here's an example I don't understand:

import mygene
mg = mygene.MyGeneInfo()
mg.querymany(["CBLB"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
    [('CBLB', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'CBLB',
  '_id': '868',
  '_score': 89.78527,
  'alias': ['Cbl-b', 'Nbla00127', 'RNF56'],
  'ensembl': {'gene': 'ENSG00000114423'},
  'symbol': 'CBLB'},
 {'query': 'CBLB',
  '_id': '326625',
  '_score': 9.830278,
  'alias': ['ATR', 'CFAP23', 'cblB', 'cob'],
  'ensembl': {'gene': 'ENSG00000139428'},
  'symbol': 'MMAB'}]

Since I'm not passing returnall=True, shouldn't this return only the best hit?

buchanae commented 6 years ago

And another.

mg.querymany(["MCM3"], species='human', fields="symbol,alias,ensembl.gene", scopes="symbol,alias")
querying 1-1...done.
Finished.
1 input query terms found dup hits:
    [('MCM3', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'MCM3',
  '_id': '4172',
  '_score': 84.13076,
  'alias': ['HCC5', 'P1-MCM3', 'P1.h', 'RLFB'],
  'ensembl': {'gene': 'ENSG00000112118'},
  'symbol': 'MCM3'},
 {'query': 'MCM3',
  '_id': '4176',
  '_score': 5.8433404,
  'alias': ['CDC47',
   'MCM2',
   'P1.1-MCM3',
   'P1CDC47',
   'P85MCM',
   'PNAS146',
   'PPP1R104'],
  'ensembl': {'gene': 'ENSG00000166508'},
  'symbol': 'MCM7'}]

As far as I can tell, the second match is happening because of a partial match on the string P1.1-MCM3

newgene commented 6 years ago

@buchanae "alias" field was indexed as free text, as we did observe the values of "alias" field can have whitespaces in it sometime. We can do some more inspection on the alias field and optimize the indexing a bit (e.g. do not treat "-" as a word separator).

sirloon commented 5 years ago

"alias" field is coming from entrez_gene collection, currently contains 21M documents: