Open bcampbell opened 9 years ago
So, looking at this issue again with fresh eyes it seems to me like it's functioning as desired. Our whole search works by applying the same analyzer to search terms and documents. The "en" analyzer turns "mother-in-law" into "mother" and "law". It's not clear to me how we would treat this as some sort of obvious exception.
I will try it with Elasticearch and see what it does.
I'd agree that it's a little obscure (although my users have definitely been really confused by this in the past).
There are two levels of string-splitting going on: the first is where the query parser breaks the input using whitespace. The second is in the Analyser, which could potentially break a string up further, into multiple terms. I think the user intuitively understands that whitespace breaks things up. But I don't think they realise the Analyser might break things up further.
I think the easy and intuitive thing to do is for QueryStringQuery to just use MatchPhraseQuery
instead of MatchQuery
.
So the query:
ill-gotten gains
Would be treated as:
"ill-gotten" "gains"
thus preserving the ordering of "ill" and "gotton" (using the default 'en' analyser), but not really making any difference to "gains".
I can't think of any cases where using MatchPhraseQuery
by default would screw things up, but obviously I could be really wrong about that ;- )
My main concern would be a possible performance implication, but I'd hope that single-term MatchPhraseQuery
s were equivalent to MatchQuery
anyway...
Just for background - as mentioned in the original mailing list thread, my motivation is for matching URLs in fields. My users really expect that:
url:/sport-section/
Would match /articles/sport-section/latest-cricket-scandal
but definitely not /articles/politics-section/terrorists-are-just-bad-sports/
I could try and train them that whitespace isn't the only place where strings are broken up and that they need to quote stuff like this, but it seems better to make the default behaviour "feel" right, if possible.
(My alternate query parser already uses MatchPhraseQuery by default. I've not noticed any problems, but then I doubt it's been given the workout that QueryStringQuery has had.)
The QueryStringQuery behaviour doesn't always give the results you'd expect. (this is a follow-up of: https://groups.google.com/forum/#!topic/bleve/cxVfZ7VQh3o )
Observed behaviour:
Using the default "en" analyser, you'd expect an unquoted query like:
mother-in-law
to be treated a single 'thing' and to match as a phrase. Instead, the query is treated asmother OR law
(the "in" is discarded as a stopword).Expected behaviour:
mother-in-law
in the above example should be treated as a phrase. The "en" analyzer splitsmother-in-law
up into[mother (pos 0), law (pos 2)]
, and the query should be aMatchPhraseQuery
instead of the currentMatchQuery
.