apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Add a simple QueryParser to parse human-entered queries. [LUCENE-5336] #6400

Closed asfimport closed 10 years ago

asfimport commented 10 years ago

I would like to add a new simple QueryParser to Lucene that is designed to parse human-entered queries. This parser will operate on an entire entered query using a specified single field or a set of weighted fields (using term boost).

All features/operations in this parser can be enabled or disabled depending on what is necessary for the user. A default operator may be specified as either 'MUST' representing 'and' or 'SHOULD' representing 'or.' The features/operations that this parser will include are the following:

The key differences between this parser and other existing parsers will be the following:


Migrated from LUCENE-5336 by Jack Conradson (@jdconrad), 3 votes, resolved Nov 12 2013 Attachments: LUCENE-5336.patch (versions: 3)

asfimport commented 10 years ago

Jack Conradson (@jdconrad) (migrated from JIRA)

I have attached a patch for this JIRA.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is AWESOME. I love how the operators (even whitespace!) are optional. And I love the name :) And it's great that it NEVER throws an exc no matter how awful the input is. And I love that it does not use a lexer/parser generator: this makes it much more approachable to those devs that don't have experience with parser generators.

Small javadoc fix: instead of "any {@code -} characters beyond the first character in a term may not need to be escaped," I think it should say "any {@code -} characters beyond the first character do not need to be escaped" (and same for * operator)"?

How does it handle mal-formed input, e.g. a missing closing " for a phrase query? If I enter "foo bar will it just make a term query for "foo and a term query for bar? Or, does it strip that " and do query foo instead? (Same for missing closing paren?). It looks like it drops the " and ( and does a simple term query (good).

Maybe you could add fangs to the random test by more frequently mixing in these operator characters ...

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

A realistic query parser is not likely to be any simpler than this, so why not call it "simple"?

asfimport commented 10 years ago

Jack Conradson (@jdconrad) (migrated from JIRA)

Thanks for the feedback.

To answer the malformed input question –

If "foo bar is given as the query, the double quote will be dropped, and if whitespace is an operator it will make term queries for both 'foo' and 'bar' otherwise it will make a single term query 'foo bar' If foo"bar is given as the query, the double quote will be dropped, and term queries will be made for both 'foo' and 'bar'

The reason it's done this way is because the parser only backtracks as far as the malformed input (in this case the extraneous double quote), so 'foo' would already be part of the query tree. This is because only a single pass is made for each query. The parser could be changed to do two passes to remove extraneous characters, but I believe that only makes the code more complex, and doesn't necessarily interpret the query any better for a user since the malformed character gives no hint as to what he/she really intended to do.

I will try to post another patch today or tomorrow.

I plan to do the following:

asfimport commented 10 years ago

Jack Conradson (@jdconrad) (migrated from JIRA)

Attached an updated version of the patch with the three modifications from my previous comment.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Javadocs and code seem to disagree on the default operator: javadocs say The default operator is AND if no other operator is specified. while the code has private BooleanClause.Occur defaultOperator = BooleanClause.Occur.SHOULD;?

Otherwise I agree with Mike that this new query parser is awesome. I will certainly use it!

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I took a swipe at trying to make the javadocs easier to read (just different layout).

Also folded in Adrien's fix.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1, javadocs and the new test look great!

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1541151 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1541151

LUCENE-5336: add SimpleQueryParser for human-entered queries

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1541158 from @rmuir in branch 'dev/branches/branch_4x' https://svn.apache.org/r1541158

LUCENE-5336: add SimpleQueryParser for human-entered queries

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Jack!

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1557073 from @mikemccand in branch 'dev/branches/lucene5376' https://svn.apache.org/r1557073

LUCENE-5336, #6440: expose SimpleQueryParser in lucene server

asfimport commented 10 years ago

Marcio Napoli (migrated from JIRA)

Believe to be interesting to include support for prefix/suffix (term* or term) and also the data range [20120910 TO 20130101]? Thanks!