apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Do not expose full-fledged scorers in LeafCollector.setScorer [LUCENE-6228] #7290

Closed asfimport closed 6 years ago

asfimport commented 9 years ago

Currently LeafCollector.setScorer takes a Scorer, which I don't like because several methods should never be called in the context of a Collector (like nextDoc or advance).

I think it's even more trappy for methods that might seem to work in some particular cases but will not work in the general case, like getChildren which will not work if you have a specialized BulkScorer or iterating over positions which will not work if you are in a MultiCollector and another leaf collector consumes positions too.

So I think we should restrict what can be seen from a collector to avoid such traps.


Migrated from LUCENE-6228 by Adrien Grand (@jpountz), resolved Sep 04 2018 Attachments: LUCENE-6228.patch (versions: 6) Linked issues:

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Simple patch that adds a new Score interface with only two methods freq() and score(). This interface is extended by Scorer and FakeScore (previously FakeScorer) and is exposed in collectors instead of Scorer. I think it's better?

asfimport commented 9 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

Having both Score and Scorer is really confusing...can we have the interface still be Scorer and the abstract class be something else...maybe ScoringEnum?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This is a great suggestion as it makes collectors easier to migrate and is also consistent with the fact that Scorer(ScoringEnum in the patch) extends PostingsEnum. Here is a new patch.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I like this, particularly if I'm about to start adding more position-aware methods to ScoringEnum.

I think FilterScorer should be FilterScoringEnum now? Similarly with AssertingScorer.

The javadocs on LeafCollector should still refer to Scorer, not ScoringEnum

Can we get rid of the FakeScorers in the taxonomy and expression modules, and in the index Sorter now?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Alan, I applied your suggestions: here is a new patch.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

another maybe more intuitive option is Scorable (interface) and keep Scorer as is, just implementing the interface.

asfimport commented 9 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

another maybe more intuitive option is Scorable (interface) and keep Scorer as is, just implementing the interface.

I like that idea also. I was only trying to suggest something that would have some larger difference in name between the two.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's a patch implementing @rmuir's idea of a Scorable interface. This seems like the least intrusive way of doing this?

As this changes LeafCollector, which is a fairly heavily-used part of the API, I think this should only apply to master

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The casts back to Scorer in some of the tests are concerning, because if code just does this then i'm not certain how much this abstraction is helping.

Can we try something like moving getChildren from Scorer to this interface (it can still have default empty implementation with a default method)? Then maybe the casts would be removed but also ChildScorer would be restricted (like mentioned in the description of the issue) from doing illegal stuff.

We should also check performance, its a bit scary to add interfaces here (maybe an abstract class is an alternative) and we should ensure that this patch isn't causing hotspot to go crazy because of FakeScorer and Scorer impls.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Can we try something like moving getChildren from Scorer to this interface

This just moves the existing problem with getChildren() to the new interface though? I think I'd rather leave it on Scorer and make it accessible through IndexSearcher or something similar in a follow-up issue. I can change the tests in question to not use Collectors.

We should also check performance

Will run it through luceneutil and see what happens

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This just moves the existing problem with getChildren() to the new interface though? I think I'd rather leave it on Scorer and make it accessible through IndexSearcher or something similar in a follow-up issue. I can change the tests in question to not use Collectors.

I don't think it does, i think it fixes the API? Instead of the following on Scorer:

  Collection<ChildScorer> getChildren();
  class ChildScorer {
    Scorer child;
    relationship;
 }

we should have on Scorable:

  Collection<ChildScorer> getChildren();
  class ChildScorer {
    Scorable child;
    relationship;
 }

This means that getChildren becomes "safe" and you can't call methods on Scorer that you shouldnt be invoking. As far as I understand, its part of the motivation behind this whole issue.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

The issue with getChildren() is that you can't access it if a BulkScorer is being used - it's a separate problem to this issue, which is preventing people calling iterator() from a Collector. Or rather, it's the same sort of problem, in that we should prevent people from calling getChildren() from a Collector, because they might be holding a FakeScorer which can't support it.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't agree, sorry. We don't need to prevent anyone from calling getChildren() on a FakeScorer, it can just return empty which is the default implementation.

What we need to prevent is someone calling advance() based on the results of getChildren, that is key: and honestly if we can't prevent that with this issue, then the abstraction isn't right, its the whole point of doing this :)

Stuff like what getChildren() on a FakeScorer or BulkScorer does, that stuff is harmless, and those things are the separate issues that we shouldnt worry about.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

What we need to prevent is someone calling advance() based on the results of getChildren

But the current patch does that by just not having getChildren() on Scorable. And just as you can get round that by casting, you could cast ChildScorer.child to a Scorer as well if you wanted to.

I guess the disagreement is over what the getChildren() API is actually for. The tests are using it to check that constant score wrappers and the like are still giving us the correct root scorer, so I think casting is OK there?

Another idea I had was to replace getChildren() entirely with a visitor API, and have a method like IndexSearcher.explain(). So you call searcher.visitScorers(Query q, int doc, ScorerVisitor visitor) and it will then visit all the scorers that are positioned on that doc. Maybe I should open another issue to deal with that first?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think this Scorable interface (or base class) is already great progress as it hides advance and other methods that collectors should never call. I tend to agree with Alan that Scorable should not have getChildren either but since this proves controversial, maybe we can keep it on Scorable for now and discuss it in a separate issue?

Another idea I had was to replace getChildren() entirely with a visitor API

For the record, this used to be a visitor API before, see #4403. As far as I'm concerned, I tend to like exposing well-known java collections better than a custom visitor API. Visitors also introduce some challenges in that case due to the tree nature of scorers. For instance should it only expose leaves or internal nodes as well? And what relationship should it assign to a Scorer that is a SHOULD clause of a BooleanQuery which is itself a MUST clause of another BooleanQuery?

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can we access the Query from this new interface?

And I think we do need getChildren here? How else would a collector find all child scorers?

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Can we access the Query from this new interface?

Not without casting back to Scorer. But this should let us clean up the hack that lets Scorers take null for their Weights, by making the various FakeScorer implementations extend Scorable instead. That way, if you want to get the parent Query, you check to see if the Scorable is a Scorer instance, and if so, cast and call getWeight().getQuery(). Right now you can get NPE if you've been passed a bulk scorer.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @romseygeek; that makes sense.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's an updated patch:

Will try and run a luceneutil benchmark against this

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to this patch, let's also make sure we add a note about this to the MIGRATE.txt? It would be a nice addition to Lucene 8 since we added more methods that shouldn't get called from a collector, like advanceShallow and getMaxScore. However we'd need to put setMinCompetitiveScore on Scorable.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch updated to latest master.

As a follow-up, we can remove FakeScorer and enforce Scorer.getWeight() to never be null, but this patch is big enough already.

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's the benchmark results:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff
              AndHighMed      917.55      (4.4%)      873.99      (7.0%)   -4.7% ( -15% -    6%)
              HighPhrase     1045.07      (8.3%)     1009.68      (8.2%)   -3.4% ( -18% -   14%)
             AndHighHigh      937.55      (5.9%)      909.34      (6.1%)   -3.0% ( -14% -    9%)
               LowPhrase     1286.87      (5.6%)     1250.09      (5.8%)   -2.9% ( -13% -    8%)
                 Prefix3      350.81     (11.8%)      341.21     (11.9%)   -2.7% ( -23% -   23%)
                  Fuzzy1      413.16      (4.2%)      402.44      (7.1%)   -2.6% ( -13% -    9%)
        HighSloppyPhrase     1385.38      (6.3%)     1356.77      (8.8%)   -2.1% ( -16% -   13%)
                Wildcard      984.38      (5.2%)      967.81      (3.4%)   -1.7% (  -9% -    7%)
               OrHighMed      884.32      (5.2%)      870.05      (7.3%)   -1.6% ( -13% -   11%)
       HighTermMonthSort     3629.00      (5.4%)     3579.19      (5.3%)   -1.4% ( -11% -    9%)
               MedPhrase     1021.25      (5.7%)     1007.93      (5.1%)   -1.3% ( -11% -   10%)
                 MedTerm     5695.18     (11.1%)     5631.21      (9.8%)   -1.1% ( -19% -   22%)
                  IntNRQ     1679.89      (3.5%)     1661.20      (2.9%)   -1.1% (  -7% -    5%)
                 LowTerm     7382.84      (9.0%)     7327.15     (10.5%)   -0.8% ( -18% -   20%)
         LowSloppyPhrase     1641.31      (7.8%)     1630.47      (6.2%)   -0.7% ( -13% -   14%)
              AndHighLow     3016.41     (11.5%)     3001.77     (12.5%)   -0.5% ( -21% -   26%)
                PKLookup      216.00      (2.0%)      215.21      (2.7%)   -0.4% (  -4% -    4%)
                 Respell      391.05      (6.2%)      390.08      (9.3%)   -0.2% ( -14% -   16%)
             LowSpanNear     1536.24      (7.1%)     1539.79      (5.6%)    0.2% ( -11% -   13%)
            HighSpanNear      722.00      (4.4%)      723.73      (5.5%)    0.2% (  -9% -   10%)
   HighTermDayOfYearSort     1217.56      (7.7%)     1224.19     (10.5%)    0.5% ( -16% -   20%)
              OrHighHigh      529.66      (8.8%)      532.79      (5.6%)    0.6% ( -12% -   16%)
                HighTerm     4502.17      (8.6%)     4535.50      (7.3%)    0.7% ( -13% -   18%)
             MedSpanNear     1194.29      (7.2%)     1206.77      (6.5%)    1.0% ( -11% -   15%)
               OrHighLow     1261.48      (9.2%)     1285.74      (6.7%)    1.9% ( -12% -   19%)
                  Fuzzy2       83.02     (12.8%)       84.71     (15.4%)    2.0% ( -23% -   34%)
         MedSloppyPhrase     1006.45     (11.2%)     1049.12     (11.2%)    4.2% ( -16% -   29%)
asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think FilterScorable should not delegate setMinCompetitiveScore and keep the default impl (like FilterScorer). Otherwise if you have a FilterScorable that overrides score() but not setMinCompetitiveScore() this would be a bug? +1 otherwise, and +1 to prevent null weights in the Scorer constructor as a follow-up.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 910a0231f6fc668426056e31d43e293248ff5ce1 in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=910a023

LUCENE-6228: Add Scorable class and make LeafCollector.setScorer() take Scorable

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I changed FilterScorable accordingly.  Thanks for the reviews!

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

git bisect says that the first failing commit for the below reproducing failures from https://jenkins.thetaphi.de/job/Lucene-Solr-master-MacOSX/4813/ is 910a0231f on this issue:

Checking out Revision d997e8b4a2717e000437953c9d66ad1f84229afd (refs/remotes/origin/master)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCollapseQParserPlugin -Dtests.method=testFieldValueCollapseWithNegativeMinMax -Dtests.seed=D4530DFC3968CF99 -Dtests.slow=true -Dtests.locale=my -Dtests.timezone=Europe/Vilnius -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.04s J0 | TestCollapseQParserPlugin.testFieldValueCollapseWithNegativeMinMax <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([D4530DFC3968CF99:F8DDBC49F4316038]:0)
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:917)
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:877)
   [junit4]    >    at org.apache.solr.search.TestCollapseQParserPlugin.testFieldValueCollapseWithNegativeMinMax(TestCollapseQParserPlugin.java:255)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >    at java.base/java.lang.Thread.run(Thread.java:844)
   [junit4]    > Caused by: java.lang.NullPointerException
   [junit4]    >    at org.apache.solr.search.CollapsingQParserPlugin$IntFloatStrategy.collapse(CollapsingQParserPlugin.java:2312)
   [junit4]    >    at org.apache.solr.search.CollapsingQParserPlugin$IntFieldValueCollector.collect(CollapsingQParserPlugin.java:1170)
   [junit4]    >    at org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:62)
   [junit4]    >    at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:217)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1622)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1438)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:585)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1436)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
   [junit4]    >    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
   [junit4]    >    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
   [junit4]    >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
   [junit4]    >    at org.apache.solr.util.TestHarness.query(TestHarness.java:338)
   [junit4]    >    at org.apache.solr.util.TestHarness.query(TestHarness.java:320)
   [junit4]    >    at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:891)
   [junit4]    >    ... 40 more
[...]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomCollapseQParserPlugin -Dtests.method=testRandomCollpaseWithSort -Dtests.seed=D4530DFC3968CF99 -Dtests.slow=true -Dtests.locale=ru -Dtests.timezone=America/Jujuy -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.16s J0 | TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: BUG using params: sort=enum_dv+asc,+collation_en_primary_last+asc,+id+desc&rows=200&fq={!collapse+size%3D7162+nullPolicy%3Dexpand+field%3Dfloat_dv_first+sort%3D'score+desc,+enum_last+asc,+id+asc'} + q=*:*&fl=id,float_dv_first
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([D4530DFC3968CF99:D1EE7E3512C0FC6A]:0)
   [junit4]    >    at org.apache.solr.search.TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort(TestRandomCollapseQParserPlugin.java:202)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >    at java.base/java.lang.reflect.Method.invoke(Method.java:564)
   [junit4]    >    at java.base/java.lang.Thread.run(Thread.java:844)
   [junit4]    > Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.NullPointerException
   [junit4]    >    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:233)
   [junit4]    >    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
   [junit4]    >    at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
   [junit4]    >    at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
   [junit4]    >    at org.apache.solr.search.TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort(TestRandomCollapseQParserPlugin.java:158)
   [junit4]    >    ... 38 more
   [junit4]    > Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.NullPointerException
   [junit4]    >    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.checkForExceptions(EmbeddedSolrServer.java:301)
   [junit4]    >    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:192)
   [junit4]    >    ... 42 more
   [junit4]    > Caused by: java.lang.NullPointerException
   [junit4]    >    at org.apache.solr.search.CollapsingQParserPlugin$IntSortSpecStrategy.collapse(CollapsingQParserPlugin.java:2513)
   [junit4]    >    at org.apache.solr.search.CollapsingQParserPlugin$IntFieldValueCollector.collect(CollapsingQParserPlugin.java:1170)
   [junit4]    >    at org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:62)
   [junit4]    >    at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)
   [junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:217)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1622)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1438)
   [junit4]    >    at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:585)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1436)
   [junit4]    >    at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
   [junit4]    >    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
   [junit4]    >    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
   [junit4]    >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
   [junit4]    >    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:191)
   [junit4]    >    ... 42 more
asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Thanks Steve, I'll take a look tomorrow.  I think I know what this is already.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3b1a335fb3dfc9d4f085740d30095ff07f48f25c in lucene-solr's branch refs/heads/master from @romseygeek https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b1a335

LUCENE-6228: Missed refactoring of CollapsingQParserPlugin delegating collector

asfimport commented 6 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This should be fixed now, sorry for the noise.