PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

A search query fails (beta PC9) #269

Closed IgorRodchenkov closed 7 years ago

IgorRodchenkov commented 7 years ago

This works: http://www.pathwaycommons.org/pc2/search?q=%22R-HSA-452723%22&type=pathway

This fails (PC9 depends on the latest Lucene 6, and thus went through some re-factoring): http://beta.pathwaycommons.org/pc2/search?q=%22R-HSA-452723%22&type=pathway

PS: The same result is when you escape '-' as '\-'

IgorRodchenkov commented 7 years ago

This seems to start happening one I upgraded to using Lucene 6 library. This query also fails with a similar message.

IgorRodchenkov commented 7 years ago

Fixed! It does not fail, and search hits ranking improved! This was done by using Lucene KeywordAnalyzer instead of StandarAnalyzer for 'name' and 'xrefid' fields (which are StringField now - unlike 'keyword', 'datasource', 'organism' are TextField.)

Let me quickly explain how it works now using searching by 'name' (by xrefis it's similar; all case insensitive, which might be not a good idea for IDs but was a fair trade-off, among minor issues...):

http://beta.pathwaycommons.org/pc2/search?q="cell%20cycle"&type=pathway at "cell cycle" matches in the name, comment of a pathway or its sub-processes exactly (not going too deep though) at any position.

Next:

http://beta.pathwaycommons.org/pc2/search?q=name:"cell%20cycle"&type=pathway

"cell cycle" - when prefixed and quoted - matches one of pathway names exactly (i.e., "cell cycle, mitotic" won't match).

Next example:

http://beta.pathwaycommons.org/pc2/search?q=name:cell\%20cycle,\%20mitotic&type=pathway

when a query string is prefixed but not quoted, it matches exactly a name of a pathway (not partially); beware how spaces (and other "special" characters) are escaped here using backslash (if not, the query would be translated into: name:cell OR cycle, OR mitotic - where the latter two terns can match in any field, not only name).

Finally:

http://beta.pathwaycommons.org/pc2/search?q=name:*cell\%20cycle*&type=pathway

a prefixed unquoted query string with wildcard symbols matches anywhere within a name of a pathway!

@gbader @mj3cheun @jvwong @ugurdogrusoz @cannin @d2fong @emekdemir @ozgunbabur Enjoy, try more...