blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.1k stars 686 forks source link

Use Lucene query syntax for Query string query syntax as much as possible #151

Open gnewton opened 9 years ago

gnewton commented 9 years ago

Unless there is a compelling reason not to, could bleve Query string query syntax https://github.com/blevesearch/bleve/wiki/Query-String-Query
adhere to the Lucene syntax as defined in http://lucene.apache.org/core/3_5_0/queryparsersyntax.html

Differences between them that I see now:

Other things not yet implemented by bleve:

mschoch commented 9 years ago

Yes, we are trying to adhere to this specification. Range query syntax was recently added, that is why it is incomplete (only supports numeric ranges right now)

Proximity and wildcard searches are not implemented yet, so we cannot yet support them in the query syntax.

Fuzzy syntax is implemented, though we interpret the numerical argument as the edit distance. This change is actually what Lucene does in more recent versions (in the doc you linked its still a float, which most people didn't understand)

Also, in more recent versions Lucene has added support for multiple syntaxes, so I don't think there is a single "lucene" syntax any more. However, here is a link to a more recent version that I've been trying to follow: http://lucene.apache.org/core/4_10_3/core/index.html

gnewton commented 9 years ago

That is great to hear!
Apologies for referencing the incorrect documentation. Here is what the Lucene people are calling the "classic" syntax: http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

Also: http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/flexible/standard/StandardQueryParser.html

Thanks for the great work! :-)

bcampbell commented 9 years ago

I've started some work on this. It's still at the mucking-about stage (and I've not yet signed a CLA), but you can track what I'm doing here:

https://github.com/bcampbell/bleve/tree/queryparser

The features I'm particularly interested in supporting are boolean expressions and date ranges, eg:

tags:( (orange AND lemon) OR citrus) published:[2014-01-01 TO 2014-01-07]

mschoch commented 9 years ago

Great, two quick thoughts:

  1. When I looked into the date range support I few surprises. First, when I added support for numeric ranges, I thought that was a special case, but in fact it seems that (at least in recent Elasticsearch/Lucene versions) its a general purpose range query, that could be numbers, dates, or even terms. So you can do [cat TO dog]... Its a little bit more complex for Bleve, because we don't enforce one type of data on a particular field, so in Bleve its possible to have dates, numbers and text terms all within a single field. I have ideas for how we can support this, but in short I'm perfectly OK if we don't solve all of this at once. Missing support for date ranges is a big hole, so anything you can do to improve that situation will be welcome.
  2. Regarding AND/OR it shouldn't be that difficult, its all about your comfort level working with scanners/parsers. There are 2 basic steps, first add support for the "(", ")", "AND", and "OR" tokens. Then update the grammar, such that all of our existing "simple" things can be nested inside of these AND/OR/PAREN groupings. Then, and this is why I didn't bother with it at first, there are ambiguities to consider...

See this section of the Elasticsearch which describes some of the problems:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_boolean_operators

If we introduce these operators, it would be nice to try and also get the precedence the same as Elasticsearch/Lucene so that queries behave in a predictable way.

Hopefully these details aren't discouraging, just some things to consider as you work.

bcampbell commented 9 years ago
  1. Regarding the date/string/number ambiguity: I was pondering this yesterday. I think the information required can be gleaned by delving through the IndexMapping, but that's only available when the search is executed (the mapping is passed in via the Searcher() fn). I did have the idea of adding a GenericRangeQuery type which papers over the differences between NumericRangeQuery and DateRangeQuery (and maybe even alphabetic ranges between text terms like lucene... not sure how feasible that is in bleve). It's also a little fiddly distinguishing between types at lexing/parsing time (eg 2014-05-01 is obviously a date, but is 20140501 a date or a number? etc...) Anyway. Like I said, still all a bit experimental.
  2. I've already added the extra tokens (and a test). Nex is nice, but a little annoying - the tSTRING definition is getting a bit unwieldly. I'm looking at the grammar now.
mschoch commented 9 years ago
  1. I actually designed Bleve with the explicit idea that you cannot know the type, even if have the IndexMapping. I did this because I have always been frustrated with Elasticsearch claiming to be schema-free, but in fact it in most configurations it chokes when you have heterogeneous data in the same field. If you allow things to be indexed automatically, and you used a generic structure like map[]interface{}, you could end up with number, date and text data indexed for the same field. I still think this is an OK trade-off, it just means that a generic range search has to search 2 or 3 ranges, in practice this won't affect performance as all but one of the ranges will be empty.
  2. Yeah, I can't tell you how many times I've tried to ditch nex, and yet I always come back. Yeah for some reason tSTING is particularly fussy. I encourage you to keep running (and add to) the unit tests here. The change I made was to allow for terms that happen to start with a number, it took forever to figure this out without breaking something else.
mschoch commented 9 years ago

One other thought, although Bleve lets you customize to handle a variety of date formats, I think its reasonable to for the query string to support one, or possibly small set of unambiguous ones. My recommendation for now is to keep it simple, a date is simply a tSTRING that also happens to be parseable as RFC3339. That is the default we use in a lot of other places.

bcampbell commented 9 years ago

I've just pushed up my progess so far to a branch on my fork: https://github.com/bcampbell/bleve/tree/queryparser

I found myself running round and round in circles with yacc, so in the end decided to go with a noddy hand-rolled parser. If ever deemed worthwhile, I could probably translate it back to yacc and nex without too much hassle. I find the hand-rolled one easier to follow and reason about, but there's value in established conventions.

notes:

I'll be back onto it next week.