Closed cgreene closed 7 years ago
Copying over relevant comments from #2.
By @cgreene:
@newgene : Would an autocomplete style search be expected to work on this field (or another one)?
By @cgreene:
@newgene : What I'm really asking - is there an ngram tokenizer used for those fields? Trying to figure out if partial queries will return sensible matches. I searched for ngram_filter and didn't find anything in the source.
I poked around in this https://github.com/SuLab/mygene.info/blob/master/src/utils/es.py a bit, but I didn't find anything obvious right off hand and thought you might know.
By @dhimmel:
@cgreene I think you're asking about partial search terms. For example, does https://mygene.info/v3/query?q=alpha-1-B%20glycoprot return a superset of the results that https://mygene.info/v3/query?q=alpha-1-B%20glycoprotein returns? It appears not, but I suggest you open a new issue, since this issue is for searching by alternate names.
@cgreene We currently do not apply, at least not explicitly, that ngram filter when doing the indexing. The autocomplete feature we implemented in this widget is made possible through the wildcard query (by adding "*" at the end of the query term), which seems working just fine.
If there is enough use cases, I'm also considering to expose prefix query to our services, probably more efficient than wildcard query.
@newgene can you give us an example that uses the wilcard (*
)? I'm not getting any hits for https://mygene.info/v3/query?q=alpha-1-B%20glycoprot*.
@dhimmel wildcard query works on specific field only:
For now, one alternative for you is to make your own customized query like this:
q=symbol:A1BG^10 OR name:A1BG OR alias:A1BG OR summary:A1BG^0.1
@newgene for the customized query, how should we encode queries with spaces or wildcards in >them? For example, how would we search for alpha-1-B glycopro* in the symbol, name, or alias with >custom weights? I'm struggling with how the URI encoding should be performed.
@dhimmel @cgreene just want to let you know that we are now working on an improvement on our query endpoint, so that you can make such query (for autocompletion) easier. It might look like this:
q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1
So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.
So stay tuned, we should have this rolled out soon. Let us know if you have any other feedback.
@newgene, thanks for the great support.
Your suggested syntax looks nice. We will probably also restrict to human and entrez genes like:
q=alpha-1-B glycoprot&suggest_from=symbol^10,alias,name,summary^0.1&species=human&entrezonly=true
Confirming that the wildcard search is implied by specifying suggest_from
so we no longer need *
.
One more thing, I think we want to make sure we can encode the query term so it's a valid URL. So some guidance on how we should encode the URL would be appreciated. I.e. in javascript do we use encodeURIComponent
or encodeURI
and on what portion of the URL?
@dhimmel you should only need to encode the value passed to "q" parameter. To encode in Javascript, this might help: http://stackoverflow.com/questions/332872/encode-url-in-javascript
@cgreene @dhimmel, it took us a while, but we now have rolled out (thanks to @cyrus0824 's hard work) a new feature of "user queries" to MyGene.info, which is highly relevant to the use case in this issue.
Basically, we now allow users to define a customized query (aka "user query") to fit their very specific use cases, where the default query feature cannot satisfy perfectly. This is how it works:
First, all user queries will be stored/versioned in this repo. Under "mygene" folder, those are user queries for mygene.info .
Each folder under "mygene" is a customized query. And the text file "query.txt" defines the query defined by the user. See this example.
You certainly need to know the query syntax from Elasticsearch to write a user query. Note that "{{q}}" will be replaced by the value passed from the "q" query parameter.
Users can submit a user query via pull request, or we can give some users the commit right to this repo.
For security reason, we won't automatically deploy the master branch to our production, we will double-check user's commits and make changes with them if needed. Then we will merge changes to a "production" branch for the deployment.
Once it's deployed, users can pass a "userquery" via URL, e.g.:
http://mygene.info/v3/query?q=a1bg&userquery=prefix
To see the actual Elasticsearch query was executed:
http://mygene.info/v3/query?q=a1bg&userquery=prefix&rawquery=1
A couple of extra notes:
It's possible to pass additional variables from URL to a user query template. For example, you can use "{{test}}" in the template, and pass it from the URL as "uqtest". "uq" prefix is required except for that "q" parameter, just to avoid the conflict with other possible parameters.
It's also possible to pass a customized filter by adding a "filter.txt" in the user query folder. A possible use-case could be a website which only focuses on a subset of genes (e.g. all kinases), and want to implement something like an autocompleted input box. Users can then include all gene ids in the filter.
All right, let us know how you guys think. The example "prefix" user query is pretty much added for the specific use case you guys mentioned in this issue. Note that we boosted up symbol matches as well. Feel free to make changes to fit what you want (can add you two to this repo for write permission).
Thanks @newgene! @bdolly was working yesterday to handle gene search for the cognoma web app. I'll tag him here so he gets notified.
We are considering using mygene.info to serve as a search backend for genes in the cognoma project front end (more discussion: https://github.com/cognoma/core-service/issues/29#issuecomment-252601701 ). One use case that we have is an autocomplete style query. For this, we'd need partial queries to be supported. Is it possible to enable this with the current API either through the standard querystring or a specific string?
There is a bit more discussion of an ngram field in https://github.com/SuLab/mygene.info/issues/2
Thanks!