apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Change all multi-term querys so that they extend MultiTermQuery and allow for a constant score mode [LUCENE-1424] #2498

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

Cleans up a bunch of code duplication, closer to how things should be - design wise, gives us constant score for all the multi term queries, and allows us at least the option of highlighting the constant score queries without much further work.


Migrated from LUCENE-1424 by Mark Miller (@markrmiller), resolved Nov 13 2008 Attachments: lucene-1424.patch, LUCENE-1424.patch (versions: 13), LUCENE-1424-dep_rng_cstr_fix.patch Linked issues:

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

What do people think about putting these classes in? They have no max clause limitations, and reports have been that they can be much more efficient on large indexes. Solr has switched from using the boolean rewriting WildcardQuery,PrefixQuery to ConstantScore queries. We already put in the ConstantScoreRange query for just these reasons, so I think it makes since to add the rest of the family. An upside will be that I can add support for these queries to the Span highlighter. That will put our Highlighter in a position of being able to highlight pretty much every major query type, which I think is an important goal.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What do people think about putting these classes in?

+1

I think in general we should more aggressively slurp useful things like this from Solr into Lucene.

Is there any reason to keep the rewrite-to-BooleanQuery version of these (vs, deprecating them in favor of the ConstantScore variety)? Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops... patch uses @Override (Java 5 only):

common.compile-core:
    [mkdir] Created dir: /lucene/lucene.constant/build/classes/java
    [javac] Compiling 333 source files to /lucene/lucene.constant/build/classes/java
    [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/ConstantScoreWildcardQuery.java:69: annotations are not supported in -source 1.4
    [javac] (use -source 5 or higher to enable annotations)
    [javac]   `@Override`
    [javac]    ^
    [javac] /lucene/lucene.constant/src/java/org/apache/lucene/search/WildcardFilter.java:49: annotations are not supported in -source 1.4
    [javac] (use -source 5 or higher to enable annotations)
    [javac]   `@Override`
    [javac]    ^
    [javac] 2 errors

I'll fix in my local checkout.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good. I plan to commit in a day or two.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

>> Are the score differences caused by the rewrite-to-BooleanQuery implementations ever "useful"?

So we need to consider what we are losing - TF, IDF, coordination, length norm, doc boosts.

I can only think of one use case which relates to coordination factor.

If you have a "category" field for a product e.g. given Lucene docs for these books:

Title: Lucene in Action Category: /Books/Computing/Languages/Java /Books/Computing/InformationRetrieval

Title: The Long Tail Category: /Books/Business/Internet /Books/Computing

You might then use a wildcard search of /Books/Computing/* and "Lucene in Action" would rank higher than "The Long Tail" because a BooleanQuery would score a higher coordination factor suggesting LIA got more hits under this "/Books/Computing.." category. There would still be the issue of IDF potentially skewing results but the coordination factor is potentially useful here.

I think in general IDF tends to be useless for "auto-expanded" terms e.g. Wildcard, fuzzy etc. Incidentally, we still see that IDF issue in fuzzy queries ranking rare mis-spellings higher but that's another issue (one I resolved in contrib's FuzzyLikeThisQuery).

I suppose one other consideration is for people who have created any doc boosts e.g. trying to use this to boost by date.

I don't think any of these cases necessarily outweigh the benefit to be obtained from switching "wildcard/prefix to constant score queries"

Cheers, Mark

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Just a suggestion: How about not inventing new Class names. Just include both code parts into e.g. (Range|Prefix|Wildcard)Query's rewrite() method. It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag in (Range|Prefix|Wildcard)Query that switches between both implementations).

To your patch: For printing out the boost in toString(): there is a nice helper routine (instead of "if (getBoost()!=1.0f)...."). I think it should be consistent with toString() of all other core query types.

asfimport commented 15 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Is there any reason to keep the rewrite-to-BooleanQuery

In addition to the coord example Mark mentioned, there are also cases where the tf and idf values for terms matching a wildcard/prefix are meaningful ... but the other advantage to keeping the existing implementations is use cases where the number of unique terms a query expands to is is very small, but the number of documents matched is very large ... in theory (but i haven't tested this) the expanding queries should be more efficient in space and equally efficient in time as the ConstantScore equivalents.

It would then be possible to just switch the QueryParser or other code by e.g. setting a global (static) flag.

I'm loath to see more static "setter" methods added to query clauses, but there's no reason this couldn't be a member property on instances of the classes with corresponding properties on QueryParser.

In theory it could even be a tertiary state property: REWRITE_TO_BOOLEAN, REWRITE_TO_CONSTANT_SCORE. REWRITE_GUESS ... where the third option caused the REWRITE method to make it's best guess using a first pass at the TermEnum iteration and aborting if the number of terms get's above some threshold.

(but that kind of "optimization" would be premature without testing .. the point is it would be possible)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Sorry bout the @overrides...grabbed/copied from solr and forgot the java 1.4 cleanse. There was also one StringBuilder in the wildcardfilter that this patch pulls.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, it sounds like we should preserve the rewrite-to-BooleanQuery option for each of these Query classes since there are clear use cases where it makes sense.

I do like the idea of adding "constant score capability" to the existing query classes, instead of adding new ConstantScoreXXXQuery classes.

I don't really like the REWRITE_GUESS option because it could lead to strange inconsistent results as seen by the end user – eg prefix1 sorts one way but then a relaxed prefix2 sorts another way. I think a simple static binary choice is good?

Couldn't we build this functionality (get/setRewriteMethod(...)) into MultiTermQuery? And then fix those queries that don't already to subclass MultiTermQuery (at least RangeQuery and PrefixQuery)?

Finally, I would also prefer non-static methods to instruct QueryParser which rewrite method to use. In fact we already have setUseOldRangeQuery – maybe change this to setUseConstantScoreRewriteMethod (defaulting to true)?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Heres an early review patch.

I switched range and prefix to use multiterm query, but its debatable if thats necessary. Prob a few clock cycles slower, and the benefits are slim beyond overall code design niceness (which counts a lot to me <g>). Just not sure every mutlitermquery would want to have to implement a constantscore version, and the amount of code reuse saved is low (it could do a bit more with something like getConstantScoreQuery, but thats not much either - I may put it in come to think of it).

TODO:

look over some comments maybe some tests for the constant score versions rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends.

When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already :) Still don't consider that a huge deal though.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good! Thanks Mark.

I think we should indeed "factor up" into MultiTermQuery the ability to create a ConstantScoreQuery out of the filter generated by enumerating the terms and walking the docs for those terms? Then we don't need to special case in each of the subclasses.

Maybe we can then fix Wildcard/Prefix/RangeFilter to create the corresponding query and then ask it for its filter (assuming we make a method eg "getDocIdSet" in MultiTermQuery)? Or I guess we could just make a QueryWrapperFilter around the corresponding query, though that seems rather roundabout. I don't think we need to have duplicated code in PrefixFilter, RangeFilter, WildcardFilter.

When I was talking about deprecating constantscorequery being awkward, its wasnt really in the implementation sense (i think we can leave it as is), but more the deprecating one of the newest queries already Still don't consider that a huge deal though.

Well ... this is just how software evolves :) You can't control which code will be the "victim" of a nice refactoring. It's a healthy sign of progress, and progress is good!

rangequery is not as expressive as constantscorequery, which can have separate inclusive/exclusive ends.

If we do deprecate ConstantScoreRangeQuery (I think we should) then we could add a ctor to RangeQuery matching ConstantScoreRangeQuery's more expressive one?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think we should indeed "factor up" into MultiTermQuery...

Ah, I see...I hadn't thought all way up that path - it goes a bit further than was in my mind. The way to go for sure.

All the comments make perfect sense, I'll work up a new patch while I'm at apachcon.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

bq In favor of new QueryWrapperFilter(new XXXQuery(...))?

I didn't end up going with the wrapper. Check out this patch just so we get back on the same page sooner rather than later...does this look like a doable way of doing it? It kills all the dupe code I think.

We can just leave Prefixfilter since its their anyway, but the prefixquery wont use it.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Excellent!

Looks like PrefixGenerator got lost. Why not have it just get its bits by creating a PrefixQuery and calling getFilter().getDocIdSet()?

WildcardFilter is still in the patch.

Maybe fix getDocIdSet to not call bits() (since that's 2-pass, it's slower). In fact, does that Filter even need to implement bits? Can we throw a not implemented exception, since it's a new Filter?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

No worries, I'm still in the thick of it - plenty of little strings to wrap up, just wanted to make sure I was on the right track.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Ahhhh...satisfying bringing order to all of that. It kaleidoscopes up quite nicely. Thanks for the help Michael.

Back to just needing:

review possibly some comments/comment changes tests for the constantscore side of the queries

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good! I think we're almost done here...

Shouldn't we deprecate ConstantScoreRangeQuery, and maybe change it to simply subclass RangeQuery and set useConstantScoreRewrite to true?

Should we match the ctor of ConstantScoreRangeQuery (field, lower, upper, inclLower, inclUpper) with RangeQuery? Right now you have Term for lower & upper, not String.

Can we rename MultiTermQuery.isUseConstantScoreRewrite() to getUseConstantScoreRewrite()? Or maybe set/getConstantScoreRewrite()?

There seems to be a leftover "abstract class TermGenerator implements IdGenerator" at the bottom of Prefixquery.java.

Small whitespace issue: you need to insert a space in "if(..." in a few places.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, almost there. Making constantscorerange query extend rangequery exposed a small problem though - there is scoring difference when using multiserver vs single searcher. The simple rangequery test in the multisearcher scoring test fails. The single searcher returns a score of 1 and the multi 1.4 - still trying to figure out how the heck that is happening...there is not a lot going on, so its prob obvious, but not sticking out at the moment...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, this should be good. Thanks for the guidance.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm having trouble applying the latest patch – I get failed Hunks, and, patch can't find one of the source files:

patching file contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java
patching file src/java/org/apache/lucene/queryParser/QueryParser.java
patching file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java
Hunk `#2` FAILED at 29.
1 out of 2 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/ConstantScoreRangeQuery.java.rej
patching file src/java/org/apache/lucene/search/MultiTermQuery.java
patching file src/java/org/apache/lucene/search/PrefixFilter.java
patching file src/java/org/apache/lucene/search/PrefixQuery.java
patching file src/java/org/apache/lucene/search/PrefixTermEnum.java
patching file src/java/org/apache/lucene/search/RangeFilter.java
patching file src/java/org/apache/lucene/search/RangeQuery.java
Hunk `#1` FAILED at 21.
1 out of 3 hunks FAILED -- saving rejects to file src/java/org/apache/lucene/search/RangeQuery.java.rej
patching file src/java/org/apache/lucene/search/RangeTermEnum.java
patching file src/java/org/apache/lucene/search/WildcardQuery.java
patching file src/java/org/apache/lucene/search/WildcardTermEnum.java
patching file src/test/org/apache/lucene/queryParser/TestQueryParser.java
patching file src/test/org/apache/lucene/search/TestConstantScoreRangeQuery.java
can't find file to patch at input line 1952
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|Index: src/test/org/apache/lucene/search/TestMultiTermConstantScore.java
|===================================================================
|--- src/test/org/apache/lucene/search/TestMultiTermConstantScore.java  (revision 710055)
|+++ src/test/org/apache/lucene/search/TestMultiTermConstantScore.java  (working copy)
--------------------------
File to patch: 

Did you turn off keyword expansion in your local svn checkout? That seems to explain the first to failures (if I manually undo the $Id$ keyword expansion then the patch applies cleanly).

On the 3rd failure, it seems like you renamed TestConstantScoreRangeQuery.java to TestMultiTermConstantScore.java, and then modified it? This confuses my patch client.

I think I can work through these, but I sure hope this is the last patch ;)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Don't work through it. I'm just using eclipse And did nothing
different, but don't waste your time. Let me figure it out. I'll make
sure it works without eclipse before putting it up.

That test was just an eclipse name refactor.

Sorry about all the patches by the way. Tends to be my style :) I tried to indicate which were merely progress patches (if someone has the interest, it allows my ,any wrong directions to be pointed out sooner) and which were something that should be considered more final review worthy. I am sure I can do a better job on that though. Certainly not my intention to waste anyones time, especially when that time could be better spent improving Lucene ;)

On Nov 8, 2008, at 12:34 PM, "Michael McCandless (JIRA)" <jira@apache.org

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry about all the patches by the way.

Whoa, no need to apologize – I in fact love all the patches and fast feedback/iterations. That's definitely the right way to do it. When I said "...but I sure hope this is the last patch" I meant "if I need to manually fix the patch failures". Since you are fixing the patch problems (thanks!) then it's all good – keep the patch iterations going.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, fingers crossed, this is good.

Eclipse was displaying the $id correctly, but it wasn't going out in the patch. Also, I didn't realize refactoring can cause issues like that, so I deleted and created a new class. Personally, I think thats fine, but let me know if you disagree. I also tested applying the patch to a fresh checkout and ran the tests. Should be good.

I think I'm done unless (probably when) you prompt something else / something I missed.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This time granting license to ASF

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Excellent – patch applies perfectly and all tests pass. I'll review the changes. Thanks Mark!

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

It looks like termContainsWildcard; and the check code can come out of wildcardquery. I pretty much got around that by making wc enum not blow up when the term doesn't contain a wildcard. There are other ways to do it that would keep the check if we prefer (a simple override of rewrite on wildcardquery would be the other option I like). I just liked the idea of WCTermEnum not blowing up like WCQuery doesnt blow up.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I made some changes to the patch – I think we are real close now:

Mark can you look this over and see if the changes are OK? Thanks.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

+1 on all of your changes.

Still had an issue with your patch with the two $id classes - I have no clue whats up with that, so I just manually paste the expanded line into my patch.

Removed some unused imports in new/modified classes and further added/tweaked appropriate javadoc.

Changed RangeQuery tests to use non deprecated constructors, exposed a null pointer issue with RangeQuery.equals and hashcode, fixed.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Excellent! OK I made a few more changes, new patch attached. This is like playing ping-pong ;):

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

looks great!

Sorry to do this, but I had to touch up the javadoc on RangeQuery as it no longer requires at least one term. I promise thats my last!

Oh yeah, and co credit would probably be more accurate.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK this version looks good! I plan to commit sometime later today... thanks Mark!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Committed revision 712890.

Thanks Mark!

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Since this was committed I get a test failure when I run 'ant -Dtag=lucene_2_4_0 test-tag'

[junit] Testcase: testRange(org.apache.lucene.queryParser.TestQueryParser):FAILED
[junit] null
[junit] junit.framework.AssertionFailedError
[junit]     at org.apache.lucene.queryParser.TestQueryParser.testRange(TestQueryParser.java:418)

Seems like this might have broken backwards-compatibility. I haven't looked into the reason yet.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Hmm the culprit is this line in 2.4's TestQueryParser.testRange():

assertTrue(getQuery("[ a TO z]", null) instanceof ConstantScoreRangeQuery);
asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Now it returns a RangeQuery that is set for constant score mode. Its behavior is back compatible with ConstantScoreRangeQuery, but we no longer return it...

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Yeah, I understand the reason... and easy "fix" would be in QueryParser#newRangeQuery():

  protected Query newRangeQuery(String field, String part1, String part2, boolean inclusive) {
    RangeQuery query;  

    if (constantScoreRewrite) {
      // TODO: remove in Lucene 3.0
      query = new ConstantScoreRangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator);
    } else {
      query = new RangeQuery(field, part1, part2, inclusive, inclusive, rangeCollator);
    }
    query.setConstantScoreRewrite(constantScoreRewrite);
    return query;
  }

Since ConstantScoreRangeQuery extends RangeQuery this works and should not change any behavior, right? I tried it out, "ant test-core test-tag" passes with this change.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Looks like a fine solution to me. Cool tests.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Thanks Mark.

OK, I'll make a patch and commit it later today.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Michael!

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

For kicks, I tried implementing the MultiTermGenerator so that it returns an inner class DocSetId iterator that iterates over matching terms/docs, thinking we could save some speed by avoiding the OpenBitSet creation, population, read. Instead, even in tests that didn't involve any skip-to (not sure what/if any of these cases actually do, still wrapping my head around that), the non generate the bitset first approach was a tad slower rather than faster. Don't know exactly why I the moment, but thought I'd chronicle the attempt for future optimizers. Probably all the stuff that happens between next calls slows down the enumeration quite a bit, so that doing it all at once saves enough time to make up for the OpenBitSet stuff.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Here's the patch. 'ant test test-tag' passes now.

I'll commit shortly.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Committed revision 713225.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Another one: looks like as we were iterating back and forth, after I changed the constructors to the non deprecated, we lost the ability of the deprecated constructors to accept null properly. Doh! I'll try and fix it tonight.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Fixes the null problem with deprecated RangeQuery constructors and adds unit test for it.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Mark – I just committed that.