blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.05k stars 682 forks source link

QueryStringQuery doesn't properly account for analysers splitting up strings #248

Open bcampbell opened 9 years ago

bcampbell commented 9 years ago

The QueryStringQuery behaviour doesn't always give the results you'd expect. (this is a follow-up of: https://groups.google.com/forum/#!topic/bleve/cxVfZ7VQh3o )

Observed behaviour:

Using the default "en" analyser, you'd expect an unquoted query like: mother-in-law to be treated a single 'thing' and to match as a phrase. Instead, the query is treated as mother OR law (the "in" is discarded as a stopword).

Expected behaviour:

mother-in-law in the above example should be treated as a phrase. The "en" analyzer splits mother-in-law up into [mother (pos 0), law (pos 2)], and the query should be a MatchPhraseQuery instead of the current MatchQuery.

mschoch commented 8 years ago

So, looking at this issue again with fresh eyes it seems to me like it's functioning as desired. Our whole search works by applying the same analyzer to search terms and documents. The "en" analyzer turns "mother-in-law" into "mother" and "law". It's not clear to me how we would treat this as some sort of obvious exception.

I will try it with Elasticearch and see what it does.

bcampbell commented 8 years ago

I'd agree that it's a little obscure (although my users have definitely been really confused by this in the past).

There are two levels of string-splitting going on: the first is where the query parser breaks the input using whitespace. The second is in the Analyser, which could potentially break a string up further, into multiple terms. I think the user intuitively understands that whitespace breaks things up. But I don't think they realise the Analyser might break things up further.

I think the easy and intuitive thing to do is for QueryStringQuery to just use MatchPhraseQuery instead of MatchQuery.

So the query:

ill-gotten gains

Would be treated as:

"ill-gotten" "gains"

thus preserving the ordering of "ill" and "gotton" (using the default 'en' analyser), but not really making any difference to "gains". I can't think of any cases where using MatchPhraseQuery by default would screw things up, but obviously I could be really wrong about that ;- ) My main concern would be a possible performance implication, but I'd hope that single-term MatchPhraseQuerys were equivalent to MatchQuery anyway...

bcampbell commented 8 years ago

Just for background - as mentioned in the original mailing list thread, my motivation is for matching URLs in fields. My users really expect that:

url:/sport-section/

Would match /articles/sport-section/latest-cricket-scandal but definitely not /articles/politics-section/terrorists-are-just-bad-sports/

I could try and train them that whitespace isn't the only place where strings are broken up and that they need to quote stuff like this, but it seems better to make the default behaviour "feel" right, if possible.

(My alternate query parser already uses MatchPhraseQuery by default. I've not noticed any problems, but then I doubt it's been given the workout that QueryStringQuery has had.)