biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Cannot wildcard search for gene ID's #30

Closed bruggsy closed 5 years ago

bruggsy commented 6 years ago

Hello all. Recently I was having trouble doing a wildcard search for gene ID's. That is, when making a query such as: http://mygene.info/v3/query?q=10800454%2A I would get results { "max_score": 1.55, "took": 7, "total": 2, "hits": [ { "_id": "9249", "_score": 1.55, "entrezgene": 9249, "name": "dehydrogenase/reductase 3", "symbol": "DHRS3", "taxid": 9606 }, { "_id": "ENSMUSG00000075014", "_score": 1.3, "name": "predicted gene 10800", "symbol": "Gm10800", "taxid": 10090 } ] }

Although the entrez gene ID 108000 matches to Cenpf, which can be verified pretty easily. The proposed workaround was to perform a batch query on the _id field, which was not default searched, as well symbol to allow for a generalized query.

However, it appears that a recent update has made the _id field unsearchable via prefix. The query http://mygene.info/v3/query?q=_id:1687*%20OR%20symbol:1687*&species=mouse now returns { "success": false, "error": "Could not execute query due to the following exception(s): ['query_shard_exception Can only use prefix queries on keyword and text fields - not on [_id] which is of type [_id]']" }

The other field for gene ID's, entrezgene is also unsearchable by prefixed query, since it is of type 'long':

{ "success": false, "error": "Could not execute query due to the following exception(s): ['query_shard_exception Can only use prefix queries on keyword and text fields - not on [entrezgene] which is of type [long]']" }

I might suggest changing this field to a string, which would allow wildcard'ed and prefixed queries? Either way, would love to see this issue fixed or if a developer could suggest another workaround. Thanks!

cyrus0824 commented 6 years ago

I agree with @bruggsy on this. Just because NCBI picks an integer as their ID, doesn't mean we should necessarily type the field as an integer. The type of the field should ideally be governed by the type of query you would do on that field (elasticsearch 6 seems more explicit about this). I would guess that doing "text"-type queries on the entrezgene field (e.g. regular expression, prefix etc) is more common than doing "integer"-type queries on it (e.g. range query, numerical comparison, etc), mostly because it is a keyword really, its numerical value is meaningless...

cyrus0824 commented 6 years ago

In a pinch, you could emulate a simple prefix query on "entrezgene" with a bunch of range queries, for example, this would be something like a query of 1687* on "entrezgene" using range queries (only applies up to 5 digits after 1687):

http://mygene.info/v3/query?q=entrezgene:1687 OR entrezgene:[16870 TO 16879] OR entrezgene:[168700 TO 168799] OR entrezgene:[1687000 TO 1687999] OR entrezgene:[16870000 TO 16879999] OR entrezgene:[168700000 TO 168799999] OR symbol:1687*&species=mouse

newgene commented 6 years ago

I agree. Let's make entrezgene field indexed as a string (the actual value still appear as integers)

sirloon commented 5 years ago

fixed in dbb1e1258140d26ea8671dff2e6039aa9d88ba6d, avail on prod as of build 20180805