SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser [LUCENE-5205]

asfimport commented 10 years ago

This parser extends QueryParserBase and includes functionality from:

Classic QueryParser: most of its syntax
SurroundQueryParser: recursive parsing for "near" and "not" clauses.
ComplexPhraseQueryParser: can handle "near" queries that include multiterms (wildcard, fuzzy, regex, prefix),
AnalyzingQueryParser: has an option to analyze multiterms.

At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases.

Same as classic syntax:

term: test
fuzzy: roam\~0.8, roam\~2
wildcard: te?t, test*, t*st
regex: /[mb]oat/
phrase: "jakarta apache"
phrase with slop: "jakarta apache"\~3
default "or" clause: jakarta apache
grouping "or" clause: (jakarta apache)
boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
multiple fields: title:lucene author:hatcher

Main additions in SpanQueryParser syntax vs. classic syntax:

Can require "in order" for phrases with slop with the _{> operator: "jakarta apache"}>3
Can specify "not near": "fever bieber"! ~3,10 :: find "fever" but not if "bieber" appears within 3 words before or 10 words after it.
Fully recursive phrasal queries with [ and ]; as in: [[jakarta apache]\_{3 lucene]}>4 :: find "jakarta" within 3 words of "apache", and that hit has to be within four words before "lucene"
Can also use [] for single level phrasal queries instead of " as in: [jakarta apache]
Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)" ~3 :: find "apache" and then either "lucene" or "solr" within three words.
Can use multiterms in phrasal queries: "jakarta ~1 ap*che" ~2
Did I mention full recursion: [[jakarta ~1 ap*che] _{2 (solr\} /l[ou]\[cs][en]\/)] ~10 :: Find something like "jakarta" within two words of "ap*che" and that hit has to be within ten words of something like "solr" or that "lucene" regex.
Can require at least x number of hits at boolean level: "apache AND (lucene solr tika)\~2
Can use negative only query: -jakarta :: Find all docs that don't contain "jakarta"
Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!).

Trivial additions:

Can specify prefix length in fuzzy queries: jakarta\~1,2 (edit distance =1, prefix =2)
Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance <=2: (jakarta\~1 (OSA) vs jakarta\~>1(Levenshtein)

This parser can be very useful for concordance tasks (see also #6381 and LUCENE-5318) and for analytical search.

Until #3952 is closed, this might have a use for fans of SpanQuery.

Most of the documentation is in the javadoc for SpanQueryParser.

Any and all feedback is welcome. Thank you.

Until this is added to the Lucene project, I've added a standalone lucene-addons repo (with jars compiled for the latest stable build of Lucene) on github.

Migrated from LUCENE-5205 by Tim Allison (@tballison), 11 votes, resolved Dec 07 2015 Attachments: LUCENE_5205.patch, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_improve_stop_word_handling.patch, LUCENE-5205_smallTestMods.patch, LUCENE-5205.patch.gz (versions: 2), LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, patch.txt, SpanQueryParser_v1.patch.gz Linked issues:

2898
- 2560
- SOLR-5410
- SOLR-5412
- 6820

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

If the community thinks this parser is worthwhile to add to Lucene, and if there's someone with the time and willingness to help me shape this into something worthy of the project, I'm more than happy to make modifications (e.g. use javacc and other refactoring). Or, better yet, if there are mods that can be made to current code to include the above functionality, I'd be happy to work on those.

I'll post updates as I add more tests and make bug fixes to my local version.

Thank you!

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

I missed this originally, sorry about that, but I jst had a quick look at the patch.

I think this has a lot more possibilities than the surround parser. So much more that this might actually replace the surround parser. Your target should be the queryparser module I think. Hopefully that will bring more users and perhaps even some maintainers.

A few details:

There is no AND query, that is a pity, but I see the point. I remember the struggle I had to combine Boolean and Span queries in surround. A user interface that provides a QueryFilter might well be enough for most users.

Are there test cases for the recursive queries? I may have overlooked them.

The source code indentation is not 2 spaces everywhere.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Paul, Thank you for your feedback. Updated version attached.

1) Some recursion tests were scattered throughout, but I added several more in a new section devoted to this. 2) Fixed the indentation (I think)

As for And query...the simplest hack that I can think of that uses only SpanQuery and Query would be the following.

Have a simple parser that takes something like this:

sq1 AND sq2 AND NOT sq3

Create a filter wrapper around a BooleanQuery that reflects the above for document retrieval and then create a SpanOr query (from sq1 and sq2) for the colorization/concordancing. The return value would be a pair of SpanQuery and Filter (where filter could be null). Or, if doc retrieval were the only goal, return the original BooleanQuery.

My first attempt wouldn't allow grouping, but that should be easy enough to add. By grouping, of course, I mean:

sq1 AND sq2 AND NOT (sq3 AND sq4)

As an integration question...I just came across the test/dev code in the test branch of oal.queryparser.flexible.spans. Is anyone working on that currently? Is there an easy way to add my functionality to that framework?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Refactored to extend QueryParserBase and add boolean and fielded query parsing. Added boosting and range queries. Added nearly all tests from QueryParserTestBase.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Added hooks for easier integration with Solr. Fixed unbounded range queries. Added "negative only" query. Numerous other small refactorings throughout.

Paul Elschot, or any other kind committer, if you have a chance to review and help me get this into shape for committing, I'd greatly appreciate it. Thank you!

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

SImilar to previous efforts

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Thanks to @iorixxx's recommendation, I added tests from TestComplexPhraseQuery. This revealed #6513.

I had to make some small tweaks to the syntax of NOTs within phrases.

If the community thinks this functionality is worthwhile, and if a kind committer would be willing to work with me to get this into shape, I'm happy to make modifications.

Thank you.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

FWIW my concern with any new query parser, particularly big ones like this which will have a non-trivial maintenance burden, is... how many query parsers do we want to ship & support with Lucene? Instead can we make an existing one better with the new capabilities you’ve added?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

@dsmiley, I completely agree. Ideally, at some point in the future, this may reduce the number of parsers required – Surround, ComplexPhrase and Analyzing – the functionality of these is covered by this one...I think.

If I were to start with an existing parser, I'd probably target ComplexPhrase. I have no experience with javacc, though, and I'm not sure I can justify the startup costs. As with so much else, those startup costs will likely turn out to be minimal and the benefit worth the costs. In general, I'd rather use a mature cc framework/tool than regexes...but I'm not sure I'll have the time or justification to get started.

At the very least, the description of the functionality and the testcases from LUCENE-5205 could be used as a guide to others proficient in javacc who want to modify CPQP. This code could also scamper off to github, but it would be great (from my perspective) to have the capability available as part of the standard distro.

Thoughts?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think its ok to add new parsers and slap experimental or some other label on them, and refactor them later?

If there are concerns about the code, there is the sandbox as an option, too, which makes things totally clear.

I'll take a look at the patch Tim, thanks.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1569953 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1569953

LUCENE-5205: create branch

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1569969 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1569969

LUCENE-5205: latest patch, synced to trunk, whitespace and braces and javadocs fixes only

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

OK I took a look, i put the current state in a branch (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205). This way, we dont have to fling enormous patches around on this issue.

A few things i noticed and fixed (or tried to):

formatting: e.g. no space between parens or exception names and braces:
whitespace: e.g. adding a line between each method, between apache license header and imports, and so on
one test class didn't extend lucenetestcase
javadocs errors (these show up as compile errors in my IDE), e.g. \@param foo with no actual text

In the process of going thru the code, i have some initial concerns, maybe we can figure out how to address:

public classes with no javadocs. if its important enough to be a public class, we should at least have a javadoc on it saying what it does.
should we just nuke the spans.tokens package? This doesn't seem useful to any end user and folding this into .spans as package-private classes could greatly reduce the API surface area.
lots of code copy-pasted from elsewhere (maybe with tweaks). this includes test code. we should try to do some refactoring of this. I can try to look at the tests tonight, and see if I can improve that side.

Any patches welcome against the branch.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570063 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570063

LUCENE-5205: clean up some test code dup

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570066 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570066

LUCENE-5205: reduce visibility of internal classes to pkg-private

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570073 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570073

LUCENE-5205: clean up some craziness in these asserts

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570074 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570074

LUCENE-5205: clean up formatting, wildcard imports

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570075 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570075

LUCENE-5205: uncomment+fix broken assertions

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570076 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570076

LUCENE-5205: nuke ancient commented-out stuff, clean up exception tests

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570080 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570080

LUCENE-5205: clean up dead code, asserts, remove warnings

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

GitHub user PaulElschot opened a pull request:

https://github.com/apache/lucene-solr/pull/38

Deprecate Surround parser 

LUCENE-5205

Nothing much to say :)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PaulElschot/lucene-solr depr-surround

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucene-solr/pull/38.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

This closes `#38`

commit 44b504f070c888bf7124939d352b2f47f77f2e8e Author: Robert Muir <rmuir@apache.org> Date: 2014-02-19T23:03:33Z

LUCENE-5205: create branch

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1569953 13f79535-47bb-0310-9956-ffa450edef68

commit 97bebc0896fcb47c3f8059c32b25cc288e84b53d Author: Robert Muir <rmuir@apache.org> Date: 2014-02-19T23:20:19Z

LUCENE-5205: latest patch, synced to trunk, whitespace and braces and javadocs fixes only

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1569969 13f79535-47bb-0310-9956-ffa450edef68

commit cfb1f36875fd3643c84b3c495641692f6c5f9a26 Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:09:47Z

LUCENE-5205: clean up some test code dup

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570063 13f79535-47bb-0310-9956-ffa450edef68

commit 7d3974d2cd4645a8933e4a667459f9baf0cda34f Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:18:33Z

LUCENE-5205: reduce visibility of internal classes to pkg-private

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570066 13f79535-47bb-0310-9956-ffa450edef68

commit e5dda4a2e13dc9419f61871561a5ca9bb54f9940 Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:29:28Z

LUCENE-5205: clean up some craziness in these asserts

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570073 13f79535-47bb-0310-9956-ffa450edef68

commit 020c91223f534a6ccb61745ff90bc273021bd60a Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:37:58Z

LUCENE-5205: clean up formatting, wildcard imports

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570074 13f79535-47bb-0310-9956-ffa450edef68

commit 53a9649d778e5887796980b51dfd6b644c27b443 Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:43:50Z

LUCENE-5205: uncomment+fix broken assertions

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570075 13f79535-47bb-0310-9956-ffa450edef68

commit 068a17bf03620ccd4f25f619ff118495023f8ef5 Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T03:51:33Z

LUCENE-5205: nuke ancient commented-out stuff, clean up exception tests

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570076 13f79535-47bb-0310-9956-ffa450edef68

commit 02c98224356d6b0b41acb572d559ab0f4e9dd716 Author: Robert Muir <rmuir@apache.org> Date: 2014-02-20T04:04:27Z

LUCENE-5205: clean up dead code, asserts, remove warnings

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene5205@1570080 13f79535-47bb-0310-9956-ffa450edef68

commit a5921753a832b79e6c11e51f1af9ad6aa1d47c96 Author: Paul Elschot <paul.j.elschot@gmail.com> Date: 2014-02-20T08:11:25Z

Deprecate Surround parser, use SpanQueryParser instead.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570190 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570190

LUCENE-5205: commit paul's patch, deprecating surround for spans

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Paul!

asfimport commented 10 years ago

ASF GitHub Bot (migrated from JIRA)

Github user PaulElschot closed the pull request at:

https://github.com/apache/lucene-solr/pull/38

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Wow, Robert, thank you!

The attached contains some trivial mods. 1) I uncommented the #6513 test cases in TestComplexPhraseSpanQuery 2) I modified new Searcher(reader) to newSearcher(reader) in a few places.

Please let me know if there are changes that I should make. I'm somewhat afraid to get in your way.

Thank you, again.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

@tballison By the way, the ASF just recently improved the integration with GitHub and the Lucene project. I suggest you fork and send a pull request, assuming both you and @rmuir find that easier. Just a suggestion.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patches of any sort are fine: svn patches, git patches, pull requests, to me its the same.

Sorry Tim, i just now saw your patch: thanks. I will commit this first and see what the branch looks like and then respond to your question.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570280 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570280

LUCENE-5205: tim's test fixes

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Please let me know if there are changes that I should make. I'm somewhat afraid to get in your way.

Please, dont worry about being in my way. Actually thats why i was working on the tests part: i don't have to understand much about this specific parser to work on them, the changes likely wont be controversial, and i wont be in your way as far as the actual code.

Here's a couple of things:

As far as concerns, i mentioned some things above, some are cleaned up, others are still problems i think. A big one is I think we should try to remove some of this code duplication here, from various places, as much as possible.

A quick thing: which of the abstract classes in the package really needs to be public? Currently all are, if they currently really NEED to be public, can we make some of these package-private? At least, if they are public, they should have some javadocs (e.g. AbstractSpanQueryParser has none on its class).

TestSpanQPBasedonQPTestBase.testDateRange is currently ignored with nocommit. It seems to sometimes fail based on the Locale. its highly possible that I caused this in my refactoring somehow, but i didnt run the tests enough times to know for sure... we should figure that out.

I'm a little worried about stuff like SpanOnlyParser/SpanQueryParser.ReInit: this is confusing. Isn't it only called by the superclass' parse(String s), which is overridden here anyway? maybe we can just replace all that code with something like 'assert false'.

ill investigate more and try to think of more later, but maybe you have some thoughts on these things.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Code duplication. The biggest offenders are in test (I think, let me know if you disagree): 1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend QPTestBase, but that will require some reworking of QPTestBase as, and I didn't want to touch that (hence the duplication). It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't want to touch anything outside of the parser at the cost of duplication. 2) TestMultiAnalyzer. This relies on testing equality of string representations of queries. Will have to modify TestMultiAnalyzer in way similar to QPTestBase. 3) TestComplexPhraseQuery. Should be straightforward to extend the original, but will need to make checkMatches public so that I can override it. I'll also have to move the tests with slightly different syntax into a different test, but that's easy and would help declutter.

There's other code duplication with AnalyzingQueryParser...should we break that functionality out into a helper class?

Any other major duplication areas?

Y, I don't like the reinit at all. The reason that's there was so that I could extend QueryParserBase, but I'm not sure that that decision buys much anymore. As I remember, it buys date parsing in range queries (which I'm now not sure I actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement CommonQueryParserConfiguration instead of extending QueryParserBase. I'd still have to leave in some things that don't make sense for the SpanQueryParser, though: lowerCaseExpandedTerms, enablePositionIncrements. Another option would be to abandon CQPC, but I wanted this parser to at least implement that interface. Let me know what makes sense.

As for the public base classes, y, those can go private for now. I made them public in case anyone wanted to extend them, but, as you point out, then I really ought to add javadocs and treat them as if they were public (which they are!).

As for date/locale issues, I'll take a look.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Code duplication. The biggest offenders are in test (I think, let me know if you disagree):

I guess i wasn't even thinking about test code, i just see code like this method in AnalyzingQueryBase (as an example):

  protected BytesRef analyzeMultitermTermParseEx(String field, String part) throws ParseException {

I know this probably exists elsewhere, i swear there is something in QPBase doing this for range queries. :)

As far as test code:

1) TestSpanQPBasedonQPTestBase...I can try to refactor this to extend QPTestBase, but that will require some reworking of QPTestBase as, and I didn't want to touch that (hence the duplication). It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't want to touch anything outside of the parser at the cost of duplication.

I worked on this last night. Basically i split up QPTestBase into two parts:

test harness, test analyzers, helper assertions, etc (QPTestCase)
actual concrete test methods for historical lucene behavior.

I fixed TestSpanQPBasedonQPTestBase to extend #1, I'm pretty happy with it? It removed all the duplicate test analyzers and assert methods and so on. Let me know if you have any concerns with what i did there.

It would also help to add a getQuery() to SpanMultitermQueryWrapper to test for equality...again, I didn't want to touch anything outside of the parser at the cost of duplication.

Hmm, i guess i'm of the opposite mentality. I can see your point about not touching existing code, as it makes this contribution a 'standalone' change, but personally i'd rather us fix stuff like this! That being said, i did in fact recently add such a method while working on an unrelated issue (#6478):

  /** Returns the wrapped query */
  public Query getWrappedQuery() {
    return query;
  }

Does this work? Want to upload a patch that uses this?

3) TestComplexPhraseQuery. Should be straightforward to extend the original, but will need to make checkMatches public so that I can override it. I'll also have to move the tests with slightly different syntax into a different test, but that's easy and would help declutter.

Ok, this sounds good.

There's other code duplication with AnalyzingQueryParser...should we break that functionality out into a helper class?

Sounds like a good idea, i havent looked at the two, but seems worthwhile.

Y, I don't like the reinit at all. The reason that's there was so that I could extend QueryParserBase, but I'm not sure that that decision buys much anymore. As I remember, it buys date parsing in range queries (which I'm now not sure I actually want) and addBoolean; there may be more, but I'm not sure there is.

It would clean up a fair bit of code if I implement CommonQueryParserConfiguration instead of extending QueryParserBase. I'd still have to leave in some things that don't make sense for the SpanQueryParser, though: lowerCaseExpandedTerms, enablePositionIncrements. Another option would be to abandon CQPC, but I wanted this parser to at least implement that interface. Let me know what makes sense.

I think the current base class is fine. I guess my question is, who is calling reinit :)

As for the public base classes, y, those can go private for now. I made them public in case anyone wanted to extend them, but, as you point out, then I really ought to add javadocs and treat them as if they were public (which they are!).

sounds good, maybe you want to upload a patch making the appropriate things pkg-private? We can always open things back up later, if we need.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

I know this probably exists elsewhere, i swear there is something in QPBase doing this for range queries.

Busted...yeah...I meant to include that in my list of duplication. Sorry.

As for Reinit, you're absolutely right...no one is calling that...let's go with documentation and assert false.

Will start some patches on the smaller stuff. Thank you, again!

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

In TestSpanQPBasedOnQPTestBase, the date issue is caused by slightly different escaping requirements in SpanQueryParser. The fix for that is overriding escapeDateString.

We could get rid of much more duplication if we were willing to add this to the various AssertQueryEquals in QueryParserTestCase:

    if (q instanceof SpanMultiTermQueryWrapper){
      q = ((SpanMultiTermQueryWrapper)q).getWrappedQuery();
    }

Better to override in TestSpanQPBasedOnQPTestBase?

Patch on other issues on way...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In TestSpanQPBasedOnQPTestBase, the date issue is caused by slightly different escaping requirements in SpanQueryParser. The fix for that is overriding escapeDateString.

Thanks for digging into this!

Better to override in TestSpanQPBasedOnQPTestBase?

I would prefer that approach (override the asserts) !

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Back on the issue of reinventing analyzeMultiterm...part of that reinvention was because I was getting a setReader() in wrong state exception in one of my tests. With analyzeMultiterm in QueryParserBase as it stands, the token stream is not consumed if an exception is thrown. Therefore, next time you run the parser (with the same analyzer) you can get:

   [junit4]    > Throwable #1: java.lang.AssertionError: setReader() called in wrong state: INCREMENT_FALSE
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([6E1DC3D6C716BC75:EDEE4C0E5E329586]:0)
   [junit4]    >        at org.apache.lucene.analysis.MockTokenizer.setReaderTestPoint(MockTokenizer.java:266)
   [junit4]    >        at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:92)
   [junit4]    >        at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:304)
   [junit4]    >        at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:181)

Should I fix this in QueryParserBase or did I not build the test analyzer correctly?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I havent looked at the code, but it seems its not consuming everything. Why is this the case? Why are there multiple tokens here?

if someone does a wildcard like "foo*" and you analyze it, and it breaks it into multiple tokens, i think you should 1) consume the rest, 2) throw exception?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Agreed.

This is how it stands in QueryParserBase:

  protected BytesRef analyzeMultitermTerm(String field, String part, Analyzer analyzerIn) {
    if (analyzerIn == null) analyzerIn = getAnalyzer();

    try (TokenStream source = analyzerIn.tokenStream(field, part)) {
      source.reset();

      TermToBytesRefAttribute termAtt = source.getAttribute(TermToBytesRefAttribute.class);
      BytesRef bytes = termAtt.getBytesRef();

      if (!source.incrementToken())
        throw new IllegalArgumentException("analyzer returned no terms for multiTerm term: " + part);
      termAtt.fillBytesRef();
      if (source.incrementToken())
        throw new IllegalArgumentException("analyzer returned too many terms for multiTerm term: " + part);
      source.end();
      return BytesRef.deepCopyOf(bytes);
    } catch (IOException e) {
      throw new RuntimeException("Error analyzing multiTerm term: " + part, e);
    }
  }

In my duplicated version, I modified the if (source.incrementToken) to a while to consume all tokens.

Raise separate issue or fix with this?

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Same issue in Solr's TextField.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

Small fixes to get date test to work. Also made more classes package private.

Robert, thanks to your lead in refactoring the common query parser test base, I'll do some more dramatic refactoring next week so that the tests for spanqueryparser extend QueryParserTestBase, not just ...Case. Er...I'll see if that is possible and worth the effort.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

On the issue of token consumption on exception during multiterm analysis and on deduplicating code generally, I propose raising a separate issue that would:

1) create a static util helper class to analyze multiterm and multitermwildcards...AnalyzingQueryParser and Solr's TextField as well as 5205's would all be calling the same multiterm analysis code. 2) fold 5205's AnalyzingQueryParserBase's functionality into QueryParserBase 3) add analysis as an option into ClassicQueryParser 4) deprecate AnalyzingQueryParser 5) fix the token consumption on exception issue centrally

Once this is committed and merged into 5205, we can get rid of AnalyzingQueryParserBase in 5205.

Thoughts?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sounds good Tim, lets do it. Feel free to break it down into more than 1 issue if you want too, that sounds like a lot at once.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

This includes the patch that I submitted earlier today.

1) fixed date test 2) made most classes package private 3) cleaned up reinit and TopLevelQuery

Will turn to other patches on Monday. Thank you!

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1570602 from @rmuir in branch 'dev/branches/lucene5205' https://svn.apache.org/r1570602

LUCENE-5205: make some abstract base classes pkg-private, fix date escaping/date test, clean up reInit

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Tim, I just committed that patch.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

This patch focuses on reducing code duplication in the test cases. Reducing duplication in the main part of code will be separate patch.

1) TestSpanQPBasedOnQPTestBase (renamed to TestQPTestBaseSpanQuery)...now subclasses QueryParserTestBase. There were a few handfuls of tests that I couldn't easily modify; mostly these were string equality tests for complex queries. The CJK examples where complex truth queries were built programmatically also had to be rewritten for SpanQueries. Those now exist in testParserSpecificQuery()...solution is not elegant, and I don't like the name. It would be slightly cleaner to move those handfuls of tests up into TestQueryParser, but I want them to be available to the other subclasses of QueryParserTestBase.

2) TestComplexPhraseSpanQuery now subclasses TestComplexPhraseQuery. Again, had to add a testParserSpecificSyntax. However, in this case the syntax btwn the two parsers is slightly different.

3) TestMultiAnalyzer (renamed to TestMultiAnalyzerSpanQuery) now subclasses TestMultiAnalyzer. These tests were mostly string equality tests on complex queries, and I couldn't easily keep any of the tests.

For the above, I'm sure there are more elegant solutions, but this is where the code is for now.

Small clean up in SpanQueryParserBase: 1) got rid of special option to lowercase regex...treat like any other multiterm, and eventually get rid of lowercasing multiterms altogether!

2) got rid of special rounding correction for fuzzy minSims. Behavior is now the same as classic.

Bugs found: 1) failed to set boost on MatchAllDocsQuery 2) fixed analysis of range terms (the irony :( )

Capabilities not currently covered by SQP 1) Known: && ! syntax 2) Unknown: "term phrase term" -> "+term +(+phrase1 +phrase2) +term"

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

@rmuir, if you have a chance to review and commit the Feb 28 patch for cleaning up the test cases, I'd greatly appreciate it! Thank you, again.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Sorry Tim! I'll try to get to this today.

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

You've had far bigger fish to fry...np at all!

asfimport commented 10 years ago

Modassar Ather (migrated from JIRA)

Hi,

Phrases with stop words in them are not getting searched whereas a phrase without it gets searched using SpanQueryParser/ComplexPhraseQueryParser. E.g. "calculator for evaluating" is not showing any result for an indexed document whereas a phrase without stop word like "evaluating mathematical" is getting searched.

The similar search works fine with classic parser which uses PhraseQuery as it includes the position of terms for all the terms including stop words in the query created. Kindly provide your inputs.

Thanks, Modassar

asfimport commented 10 years ago

Tim Allison (@tballison) (migrated from JIRA)

The root of this problem is that SpanNearQuery has no good way to handle stopwords in a way analagous to PhraseQuery.

In SpanQueryParser, this limitation should be well described in the javadocs to SpanQueryParser and in the test cases. Let me know if it isn't. You have the option of throwing an exception when a stopword is found to notify the user about stopwords, but that's exceedingly unsatisfactory.

Without digging into the internals of SpanNearQuery, we can still do better on this. One proposal is to do what the basic highlighter does and risk false positives...behind the scenes modify "calculator for evaluating" to "calculator evaluating"\~>1. This would then falsely match "calculator zebra evaluating." PhraseQuery can have false positives, too, but it guarantees that the false hit has to be a stop word. This solution would not do that. So, is this better than no matches at all?

apache / lucene

SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser [LUCENE-5205] #6269

2898

2560

6820