biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

Partial results matching #4

Closed cgreene closed 7 years ago

cgreene commented 7 years ago

We are considering using mygene.info to serve as a search backend for genes in the cognoma project front end (more discussion: https://github.com/cognoma/core-service/issues/29#issuecomment-252601701 ). One use case that we have is an autocomplete style query. For this, we'd need partial queries to be supported. Is it possible to enable this with the current API either through the standard querystring or a specific string?

There is a bit more discussion of an ngram field in https://github.com/SuLab/mygene.info/issues/2

Thanks!

dhimmel commented 7 years ago

Copying over relevant comments from #2.

By @cgreene:

@newgene : Would an autocomplete style search be expected to work on this field (or another one)?

By @cgreene:

@newgene : What I'm really asking - is there an ngram tokenizer used for those fields? Trying to figure out if partial queries will return sensible matches. I searched for ngram_filter and didn't find anything in the source.

I poked around in this https://github.com/SuLab/mygene.info/blob/master/src/utils/es.py a bit, but I didn't find anything obvious right off hand and thought you might know.

By @dhimmel:

@cgreene I think you're asking about partial search terms. For example, does https://mygene.info/v3/query?q=alpha-1-B%20glycoprot return a superset of the results that https://mygene.info/v3/query?q=alpha-1-B%20glycoprotein returns? It appears not, but I suggest you open a new issue, since this issue is for searching by alternate names.

newgene commented 7 years ago

@cgreene We currently do not apply, at least not explicitly, that ngram filter when doing the indexing. The autocomplete feature we implemented in this widget is made possible through the wildcard query (by adding "*" at the end of the query term), which seems working just fine.

If there is enough use cases, I'm also considering to expose prefix query to our services, probably more efficient than wildcard query.

dhimmel commented 7 years ago

@newgene can you give us an example that uses the wilcard (*)? I'm not getting any hits for https://mygene.info/v3/query?q=alpha-1-B%20glycoprot*.

newgene commented 7 years ago

@dhimmel wildcard query works on specific field only:

https://mygene.info/v3/query?q=name:alpha-1-B%20glycoprot*

newgene commented 7 years ago

For now, one alternative for you is to make your own customized query like this:

q=symbol:A1BG^10 OR name:A1BG OR alias:A1BG OR summary:A1BG^0.1

@newgene for the customized query, how should we encode queries with spaces or wildcards in >them? For example, how would we search for alpha-1-B glycopro* in the symbol, name, or alias with >custom weights? I'm struggling with how the URI encoding should be performed.

@dhimmel @cgreene just want to let you know that we are now working on an improvement on our query endpoint, so that you can make such query (for autocompletion) easier. It might look like this:

q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1

So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.

dhimmel commented 7 years ago

So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.

@newgene, thanks for the great support.

Your suggested syntax looks nice. We will probably also restrict to human and entrez genes like:

q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1&species=human&entrezonly=true

Confirming that the wildcard search is implied by specifying suggest_from so we no longer need *.

One more thing, I think we want to make sure we can encode the query term so it's a valid URL. So some guidance on how we should encode the URL would be appreciated. I.e. in javascript do we use encodeURIComponent or encodeURI and on what portion of the URL?

newgene commented 7 years ago

@dhimmel you should only need to encode the value passed to "q" parameter. To encode in Javascript, this might help: http://stackoverflow.com/questions/332872/encode-url-in-javascript

newgene commented 7 years ago

@cgreene @dhimmel, it took us a while, but we now have rolled out (thanks to @cyrus0824 's hard work) a new feature of "user queries" to MyGene.info, which is highly relevant to the use case in this issue.

Basically, we now allow users to define a customized query (aka "user query") to fit their very specific use cases, where the default query feature cannot satisfy perfectly. This is how it works:

All right, let us know how you guys think. The example "prefix" user query is pretty much added for the specific use case you guys mentioned in this issue. Note that we boosted up symbol matches as well. Feel free to make changes to fit what you want (can add you two to this repo for write permission).

cgreene commented 7 years ago

Thanks @newgene! @bdolly was working yesterday to handle gene search for the cognoma web app. I'll tag him here so he gets notified.