apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

FilteredQuery ignores boost [LUCENE-698] #1773

Closed asfimport closed 17 years ago

asfimport commented 17 years ago

Filtered query ignores it's own boost.


Migrated from LUCENE-698 by Yonik Seeley (@yonik), resolved May 30 2007 Attachments: lucene-698.patch

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I just commited hashCode() and equals() changes to take boost into account so that generic tests in QueryUtils.check(query) can pass.

One should arguably be able to set the boost on any query clause, so I'm leaving this open since I think scoring probably ignores the boost too.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

With this patch FilteredQuery takes the boost into account for scoring. It includes a test that fails with the trunk version and passes with this patch.

This patch also removes one test from TestSimpleExplanations: testFQ7(). These tests check if the score and the value in the explanation are the same. testFQ7() in particular verifies this for a FilterQuery with a boost of 0. But with a boost of 0 the score and the explanation has the value NaN, which makes assertEquals() fail. So I believe this is a incorrect test case. We just didn't notice it before because FilteredQuery did not take the boost into account.

All unit tests pass.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

i think the test class and test case testFQ7 in particular are "correct" in the sense that they try to verify every conceivable permutation of stock query times has an explanation that matches it's score ... the problem may just be in the CheckHits.ExplanationAsserter class ... perhaps it should test if either the score or the explanation value are NaN before comparing them, and fail if only one is NaN or if neither is NaN but they are not equal)

(after all: if the score is NaN, then the explanation should be NaN as well)

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> perhaps it should test if either the score or the explanation value are NaN > before comparing them, and fail if only one is NaN or if neither is NaN but > they are not equal)

Thanks for reviewing, Hoss! You are right, we could do that and I was actually thinking about it already. The problem is if I make this fix than testFQ7 fails for TestSimpleExplanationsOfNonMatches because it is assumed that all non matching docs have a score of 0.0. I can easily change that, so that non matching docs can either have a score of 0.0 or NaN but I was not sure if we want that, because other scoring bugs resulting in a score of NaN (which we will hopefully never have) wouldn't be noticed then anymore.

The reason why I argued that testFQ7 is an invalid test case is that it would fail for any other query with a boost set to 0. Ironically we have this test only for FilteredQuery, the only query class that ignores the boost, which made it pass in the past.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Ahhhh ... yes, looking back at the comments in #1635 I remember now: I originally thought boosts of 0.0 were legal for all queries, and then discovered i was wrong, and removed a bunch of tests – but i clearly missed this one because it wasn't failing.

we should go ahead and remove the test ... but we should probably also fix FilteredQuery so that a boost of 0 produces some other result then just a NaN score (either an exception, or a score of 0) since as you say: NaN scores are bad.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> but we should probably also fix FilteredQuery so that a boost of 0 > produces some other result then just a NaN score (either an exception, > or a score of 0) since as you say: NaN scores are bad.

TermQuery actually behaves the same way. If boost is zero, then sumOfSquaredWeights() returns zero as well, resulting in a queryNorm of Infinity (due to a div by zero if DefaultSimilarity is used). Then it multiplies boost and queryNorm and 0*Infinity=NaN.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Maybe Query.setBoost() should throw an IllegalArgumentException in case the value is zero?

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Hmmm.. didn't realize that. I withdrawal all previous comments. patch seems fine to me.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

whoops .. comment collision.

i think the patch as it stands is fine for this issue .. but we may want another issue to hollisticly question NaN as a score.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Maybe Query.setBoost() should throw an IllegalArgumentException in case the value is zero?

FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the set of results but not their ranking. In particular, it uses this to automatically convert query clauses into filters, so that query clauses like "lang:en" can be implemented as cached filters. Note that not all such clauses are so optimized.

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the > set of results but not their ranking. In particular, it uses this to > automatically convert query clauses into filters, so that query > clauses like "lang:en" can be implemented as cached filters. Note > that not all such clauses are so optimized.

Thanks for the hint, Doug. OK, I understand how you use boost=0.0f in Nutch. Quite cool and elegant idea actually!

I guess then throwing an IllegalArgumentException in case boost=0 would break this. The question remains if we should fix the scorers to never return NaN. Hmm, I'm not completely sure how to do this. Maybe DefaultSimilarity.queryNorm() should return 0 instead of Infinity in case sumOfSquaredWeights is 0. But then with custom Similarity implemenations we could still end up getting NaN.

A different solution of course is to fix it in the scorers itself, to return a score of 0 in case boost is 0. But then we'd have to add checks in the score() and explain() methods, which might be a performance overhead.

So I'm not sure if we should "fix" this at all considering these difficulties and the fact that nobody complained (I think?) about the NaN so far.

Anyway, I'll go ahead and commit LUCENE-698 since this NaN problem is a separate issue and not only happing for the FilteredQuery.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Patch committed.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> If boost is zero, then > sumOfSquaredWeights() returns zero as well, resulting in a > queryNorm of Infinity (due to a div by zero if DefaultSimilarity is > used). Then it multiplies boost and queryNorm and 0*Infinity=NaN.

The bug here to me seems that queryNorm is Infinity. A boost of zero has a reasonable interpretation (don't influence scoring), but I don't see how a queryNorm of Infinity is ever useful. So perhaps we can remove the NaN by modifying the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero. Would that cause any harm?

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero.

That seems like it should be fine, esp since Similarity.queryNorm is only called at the top level when creating a weight.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> So perhaps we can remove the NaN by modifying the default implementation of > queryNorm to return 1.0 instead of Infinity when passed zero. Would that > cause any harm?

Yes I believe this should work, too. This would prevent the NaN score when DefaultSimilarity is used. It will be the responsibility of people who implement their own Similarity then to take care of this in a similar way.

I'll open a new issue for fixing the DefaultSimilarity.