apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

Optimizations to TopScoreDocCollector and TopFieldCollector [LUCENE-1593] #2667

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

This is a spin-off of #2649 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is:

  1. Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs().
  2. Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete.
  3. Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null.
  4. Also move to use "changing top" and then call adjustTop(), in case we update the queue.
  5. some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in).
  6. Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without "arranging" it, just store the objects in the array (this can be used to pre-populate sentinel values)?

I will post a patch as well as some perf measurements as soon as I have them.


Migrated from LUCENE-1593 by Shai Erera (@shaie), 1 vote, resolved May 09 2009 Attachments: LUCENE-1593.patch (versions: 5), PerfTest.java

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

But actually: the thing calling scoresDocsInOrder will in fact only be calling that method if it intends to use the scorer as a toplevel scorer

Are you sure? The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that.

I still think there are two issues here that need to be addressed separately:

  1. Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet).
  2. Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok?

What do you think?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that.

Yeah you're right, it is in two separate places today.

Though since we are reworking how filters are applied, at that point it may very well be in one place.

Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet).

OK let's tentatively move forwards with Query.scoresDocsInOrder.

Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

OK let's slate this for 3.0, then.

BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok?

OK let's move forwards with this too?

Phew!

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Hmmm .. I think DISI.start() breaks back-compat, since if we optimize the scorers to not check if they're initializes in next() and skipTo(), then you'll get NPE (or something else will happen). Even if we fix IndexSearcher to call start(), someone may still iterate on a Scorer privately, or in a custom code (I know I do).

I think this change should go into 3.0 as well, as it's a wider change than I though initially. It affects more than just BS2, but all of its internal classes, as well as some other Scorers. Also, I see in several scorers different TODOs to get rid of that init() check in next() and skipTo(), and so this smells like a wider change.

Since it breaks back-compat and the change will affect not just BS/BS2, I prefer to leave that optimization out of them for now, and fix it all in 3.0, including the other scorers.

So we have two issues for 3.0:

  1. Introduce start() in DISI and change all the classes that extend DISI to take advantage of it, as well as all the code that uses DISI to call start().
  2. Introduce topScorer() to Weight, and take advantage of it where it makes sense (currently we know of BW), and change all the code that calls scorer.score(Collector) to request a topScorer() from Weight.

Since Scorer extends DISI these often look to be the same usage, but I think they are different, with different use cases. What do you think?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok - so after I posted the last comment I took my dog out and thought about this some more. At first, I thought that I was wrong because BS and BS2 are package-private and so we can still add start() to DISI and take advantage of it in BS and BS2 only, under the assumption that they cannot be instantiated. In 3.0 we'll do the more wider change in DISIs and Scorers.

But then I realized that someone can do this today:

BooleanQuery bq = new BooleanQuery();

// add some queries/clauses ...

Scorer s = bq.weight(searcher).scorer(searcher.getIndexReader());
// s is of type BS2
while (s.next()) {
  // something ...
}

So I'm now convinced this breaks back-compat.

Right?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So I'm now convinced this breaks back-compat.

Woops, yes it does. Grr.

The thing is... I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features. I'm not sure if a sudden change in runtime behavior ("you must call Scorer.init() before calling next or skipTo") is allowed.

Maybe we could make a Weight.initializableScorer, that returns a Scorer that requires init() be first called. But since Weight is an interface, we can't change it. So maybe we can make a new abstract class called AbstractWeight (for lack of a better name), implementing Weight. We would deprecate Weight (and remove it at 3.0). We can make a new "get me a Scorer" API in AbstractWeight, eg, require that Scorers returned from there must have "init" called first, pass in an "isTopScorer" boolean, etc. Query would have a "abstractWeight()" method, emulated by wrapping the "weight()" method. Could something crazy like this work....? Maybe we should break out the two goals: this [new] goal is simply to migrate away from Weight as interfaace to AbstractWeight as abstract class, then step 2 is to make the optimizations we are discussing here.

This is like running in a potato sack race!

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

bq, I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features.

I'd like to discuss that in a separate thread, where it will have more visibility ... I'm a bit puzzled by what 3.0 means, but it should be discussed outside the scope of this issue.

So maybe we can make a new abstract class called AbstractWeight ...

I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think?

When I thought about the ambiguity that we have in BS2 between score(Collector) and next()/skipTo() and the proposal to have topScorer() and scorer(), I thought that perhaps we can make the following change (we'd have to solve the Weight-interface problem first):

  1. Define on Weight a score(IndexReader, Collector) API which will be called instead of the topScorer() proposal.
  2. Keep the scorer(IndexReader) API - this should be used for iterating over the Scorer.
  3. Make Scorer.score(Collector) package-private so that it can still be used by Weight.score(IndexReader, Collector), but not by anyone else. That will effectively remove that API from Scorer, but still keep the impl there so we make the least amount of changes to the current Scorers and Weights.
    • We should document that it should not be used, even from inside Lucene's code unless there's a really good reason. Everyone, including Lucene should use the Weight.score(IndexReader, Collector) API.

That should present a clean and clear API, i.e. topScorer() and scorer() might not be understood well, and we'd need to document their usage clearly, and we don't have a way to enforce that once topScorer() is called, score(Collector) will be the only method that's used and not next()/skipTo().

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think?

Yes, and this presumably depends on the outcome of the first "how much can change in 3.0" thread.

I thought that perhaps we can make the following change

Once again I'm lacking clarity.... there are many related possible improvements to searching:

I'm not yet sure what steps to take now (and how) vs later...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another "things we should improve about the Scorer API":

Enrich Scorer API to optionally provide more details on positions that caused a match to occur.

This would improve highlighting (#2596) since we'd know exactly why a match occurred (single source) rather than trying to reverse-engineer the match.

It'd also address a number of requests over time by users on "how can I get details on why this doc matched?".

I think if we did this, the *SpanQuery would be able to share much more w/ their "normal" counterparts; this was discussed @ http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html. Ie we would have a single TermQuery, just as efficient as the one today, but it would expose a "getMatches" (say) that enumerates all positions that matched.

Then, if one wanted these details for every hit on in the topN, we could make an IndexReader impl that wraps TermVectors for the docs in the topN (since TermVectors are basically a single-doc inverted index), run the query on it, and request the match details per doc.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch includes:

  1. New scoresDocsInOrder to Query
    • Default to false
    • Override in extensions to return true, except in BQ which still returns false until we resolve how BQ is used explicitly (top-score vs. not). In some queries that delegate the work, I used the delegatee or return true if all sub-queries return true.
  2. Changed TopFieldCollector and TopScoreDocCollector to take a docsScoredInOrder parameter and create the appropriate instance (breaking ties by doc Id or not).
  3. Added TestTopScoreDocCollector and a test case to TestSort which test out-of-order collection (they trigger the use of BooleanScorer, though whether document collection happens truly out of order I cannot tell).
  4. Updates to CHANGES

All tests pass, including test-tag. BTW, the patch also includes the fix to TestSort in tag, but without the fix for MultiSearcher and ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as well.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Another "things we should improve about the Scorer API".

Not in this issue though, right?

I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Not in this issue though, right?

Right: I'm back into the mode of throwing out all future improvements I know of, to help guide us in picking the right next step. These would all be done in separate issues, and many of them would not be done "today" but still we should try not to preclude them for "tomorrow".

I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore.

Exactly, and I think/hope this'd be achievable.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or?

Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just lower performance.

Sure, our top doc collectors can accept "in order" or "out of order" collection, but perhaps one has a collector out there that must get the docs in order, so shouldn't we be able to ask the Query to give us docs "always in order" or "doesn't have to be in order"?

Also: I wonder if we would ever want to allow for non-top-scorer usage that does not return docs in order? Ie, next() would be allowed to yield docs out of order. Obviously this is not allowed today... but we are now mixing "top vs not-top" with "out-of-order vs in-order", where maybe they should be independent? But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or?

I'm afraid this might mean a larger change. What will TermQuery do? Today it returns true, and does not have any implementation that can return docs out-of-order. So what should TQ do when outOfOrderScorer is called? Just return what inOrderScorer returns, or throw an exception?

That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true.

And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck :)? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case?

But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

I agree. I think that iteration on Scorer is dictated to be in order because it extends DISI with next() and skipTo() methods which don't imply in any way they can return something out of order (besides next() maybe, but it will be hard to use such next() with a skipTo()).

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

docsInOrder() would be an implementation detail (and could actually vary per reader or per segment) and should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

docsInOrder() would be an implementation detail ... should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right?

There are two problems with that:

  1. IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false.
  2. A user might want to know what Collector implementation to create before calling search(Query, Collector), and I don't think we should ask the users to call query.weight.scorer() just to obtain that information.

So I understand why at the end it's a Scorer attribute, but Scorers really belong to Queries and so this can be viewed also as a Query attribute.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Query objects are relatively abstract. Weights are created only with respect to a Searcher, and Scorers are created only from within that context with respect to an IndexReader. It really seems like we should maintain this separation and avoid putting implementation details into the Query object (or the Weight object for that matter).

A user might want to know what Collector implementation to create before calling search(Query, Collector)

Having to create a certain type of collector sounds error prone.
Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What will TermQuery do?

Oh: it's fine to return an in-order scorer, always. It's just that if a Query wants to use an out-of-order scorer, it should also implement an in-order one. Ie, there'd be a "mating process" to match the scorer to the collector.

That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true.

And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case?

Well... we have to be careful. EG say we had some great optimization for iterating over matches to PhraseQuery, but it returned docs out of order. In that case, I think we'd preserve the in-order Scorer as well?

But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

One case might be a random access filter AND'd w/ a BooleanQuery. In that case I could ask for a BooleanScorer to return a DISI whose next is allowed to return docs out of order, because 1) my filter doesn't care and 2) my collector doesn't care.

Though, we are thinking about pushing random access filters all the way down to the TermScorer, so this is example isn't realistic in that future... but it still feels like "out of order iteration" and "I'm top scorer or not" are orthogonal concepts.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One further optimization can be enabled if we can properly "mate" out-of-orderness between Scorer & Collector: BooleanScorer could be automatically used when appropriate.

Today, one must call "BooleanQuery.setAllowDocsOutOfOrder" which is rather silly (it's very much an "under the hood" detail of how the Scorer interacts w/ the Collector). The vast majority of time it's Lucene that creates the collector, and so now that we can create Collectors that either do or do not care if docs arrive out of order, we should allow BooleanScorer when we can.

Though that means we have two ways to score a BooleanQuery:

We'd need to test which is most performant (I'm guessing the 2nd one).

So maybe we should in fact add a "acceptsDocsOutOfOrder" to Collector.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

Yonik would you suggest we migrate Weight to be an abstract class instead? (This is also being discussed in a separate thread on java-dev, if you want to respond there...).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false.

In early iterations on #2557, we allowed Collector.setNextReader to return a new Collector on the possibility that a new segment might require different collector.

We could consider going back to that... and allowing the builtin collectors to receive a Scorer on creation, which they could interact with to figure out in/out of order types of issues. We could then also enrich setNextReader a bit to also receive a Scorer, so that if somehow the Scorer for the next segment switched to be in-order vs out-of-order, the Collector could properly "respond".

Or we could require "homogeneity" for Scorer across all segments (which'd be quite a bit simpler).

Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

I tentatively like this approach, ie add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer.

But still internal creation of Collectors could go both ways, and so we should retain the freedom to optimize (the BooleanScorer example above).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

BooleanScorer could be automatically used when appropriate

If we do this (and I think we should – good perf gains, though I haven't tested just how good, recently), then we should deprecate setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder). And make it clear that internally Lucene may ask for either scorer, depending on the collector.

asfimport commented 15 years ago

Marvin Humphrey (migrated from JIRA)

I made Weight a subclass of Query and all of a sudden Searcher method signatures got easier to manage.

PS: Is this a good place to discuss why having rambling conversations in the bug tracker is a bad idea, or should I open a new issue?

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer.

Maybe instead add docsOrderSupportedMode() which returns IN_ORDER, OUT_OF_ORDER, DONT_CARE? I.e., instead of a boolean allow a Collector to say "I don't really care" (like Mike has pointed out, I think, somewhere above) and let the Scorer creation code decide which one to create in case it knows any better. I.e., if we know that BS performs better than BS2, and we get a Collector saying DONT_CARE, we can always return BS. Unless we assume that OUT_OF_ORDER covers DONT_CARE either, in which case we can leave it as returning boolean and document that if a Collector can support OUT_OF_ORDER, it should always say so, giving the Scorer creator code a chance to decide what is the best Scorer to return.

In IndexSearcher we can then:

  1. Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer.
  2. Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine.

I like the approach of deprecating Weight and creating an abstract class, though that requires deprecating Searchable and creating an AbstractSearchable as well. Weight can be wrapped with an AbstractWeightWrapper and passed to the AbstractWeight methods (much like we do with AbstractHitCollector from LUCENE-1575), defaulting its scorer(inOrder) method to call scorer()?

This I guess should be done in the scope of that issue, or I revert the changes done to Query (adding scoresDocsInOrder()), but keep those done to TFC and TSDC, and make that optimization in a different issue, which will handle Weight/Searchable and the rest of the changes proposed here?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Unless we assume that OUT_OF_ORDER covers DONT_CARE either

I think this is the case? Ie a boolean suffices.

For Collector that boolean means "can accept docs out of order". For the Scorer it means "might deliver docs out of order".

Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer.

Good. And default Collector.acceptsDocsOutOfOrder should return false.

Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine.

I think this is good, though we should 1) ask the Scorer for an out-of-order Scorer, but then once we get the resulting scorer back we should 2) ask that instance if in fact it will ever return out-of-order (all except BS will not), and then 3) pick a collector that's optimized for in-order collection if the scorer always returns in-order docs.

The big problem is the fact that we get Scorers per segment, but Collector once. Actually it may not be a problem: maybe for the first segment we do the logic above, but then for subsequent segments we explictly ask for an in-order Scorer if the first one was in-order? Ie we can enforce homogeneity ourselves?

This would require deferring creating the Collector until we've seen the Scorer for the first segment.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene.

Actually BS can accept out-of-order sub-scorers? They just have to implement the Scorer.score(Collector, int maxDoc)? So, yes, they have to stay w/in the requested bracket, but inside there they can do things out of order – the collector is an instance of BolleanScorerCollector (hmm – mispelled – I'll fix) which happily accepts out of order but within bracket docs.

But it's good to know that out-of-order scorers are not generally supported even if Lucene uses one internally for better BooleanQuery (OR) performance.

asfimport commented 15 years ago

Marvin Humphrey (migrated from JIRA)

> I think I'd lean towards the 12 impls now.

Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post to the Lucy dev list entitled "SortCollector".

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

And default Collector.acceptsDocsOutOfOrder should return false.

Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract.

Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post to the Lucy dev list entitled "SortCollector".

I read it, but I'm not sure I agree with everything that you write there. I need to re-read it more carefully though before I can comment on it. One thing that caught my eye is that you write "I found one additional inefficiency in the Lucene implementation: score() is called twice for "competitive" docs". Where exactly did you see it? I checked TFC's code again and score() is never called twice. RelevanceComparator wraps the given Scorer with a ScoreCachingWrapperScorer, so the score() calls return almost immediately, without computing any scores.

This was a tradeoff we've made because of the TFC instances that don't compute documents scores, and so we removed the score parameter from FieldComparator.copy() and compareBottom(). We could have added it back and pass in the not-scoring versions Float.NEG_INF, but that will not work well, since we should really compute the document's score if one of the SortField is RELEVANCE ... hmm - maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract.

Woops, it was for back-compat (I forgot Collector isn't released). So let's simply make it an abstract method?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature?

I don't think we should add score back into the method signatures? Most comparators don't need the score.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I don't think we should add score back into the method signatures? Most comparators don't need the score.

It's more for efficiency than if they need it or not. We've added setScorer to FieldComparator with a default impl which does nothing. So in a sense we've already introduced Scorer, although currently the Comps don't know about it. But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call.

So this indeed must be what Marvin was referring to? But the 2nd .score() call hits the cache and returns quickly? I wouldn't worry about that.

Making all other field comparators pay the price of passing around unused Float.NEG_INF scores is also wasteful.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice - once in tfc.collect(), which does not use a caching scorer. and 2nd in RelComp.copy()/compareTo(), which does use a caching scorer. If we want to handle it, although that is somewhat more of an edge case, I suggest that we check in TFC.create() whether any of the scorers is of type SortField.FIELD_SCORE and if so wrap the scorer given to setScorer with ScoreCachingWrapperScorer, and remove that wrapping from RelevanceComparator. That way, both Collector and Comparator will use the same caching scorer.

Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it? I'm not sure about it since it will violate the API, i.e. you asked for a non-scoring collector and get a scoring one just because one of your sort fields was of type "sort-by-score". But then again, it is really an edge case, and I'm not sure why would someone want to do it.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice

OK I see – I think we should make a ScoreCachingWrapperScorer when FIELD_SCORE is one of the SortFields.

I think this may come up fairly often... eg if one sorts by field X and then score, as the tie breaker.

Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it?

I don't think we should do this? (Ie, the API shouldn't do "magic" for you under-the-hood).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok will do that. I also would like to summarize what the latest posts here:

  1. Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract.
    • Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private.
    • Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes.
  2. Add to Scorer isOutOfOrder with a default to false, and override in BS to true.
  3. Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder.
  4. Add to Collector an abstract acceptsDocsOutOfOrder which returns true/false.
    • Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight.
    • Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance.
    • Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance.
  5. Modify IndexSearcher to use all of the above logic.

The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following:

I really hope I covered everything in this summary.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2! Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2

Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future.

Also, it's not just about creating BS2 or BS sometimes ... it's about the changes in this issue which moved to assume at first in-order documents collection, and thus did not break ties on doc Ids at the Collector level. In order to allow this to work with the current BS, we need to have a way to determine which scorer will be used. Or ... we can stop using BS at all and saying all scorers must work in-order.

Also, it's not that large of change, just lots of text :) (my fault). In the end we'll achieve some refactoring, and few more deprecated methods.

Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.

I'm not sure I understand this - where would you set it? On IndexSearcher ctor? search methods (which means changes to the interfaces)? On Scorer (which is most local I can think of)?

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Or ... we can stop using BS at all and saying all scorers must work in-order.

Well, BS is the odd man out... It's not currently used unless the user specifically sets it up and it doesn't implement skipTo() (although the latter could be fixed presumably).

>> Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized. > where would you set it?

Haven't seriously thought about it, but It could be set inside IndexSearcher.search() method and checked in BooleanWeight.scorer().

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Shai your summary of what needs to be done looks right. But: shouldn't we do the interface -> abstract class migration (of Weight & Searchable) under a separate issue? Ie, under that issue no real functional change to Lucene is happening. Then in this issue we can make the optimizations?

I just ran a quick perf test (best of 5 runs, Linux, JDK 1.6) of the query 1 OR 2 on a large Wikipedia index. Using BS instead of BS2 gives a 27% speedup (2.2 QPS -> 2.8). I'd really like for Lucene to be able to use BS automatically when it can. In fact, I think we should move more scorers to out-of-order, if we can get these kinds of performance gains.

These changes go beyond that, though, and also enable a separate optimization whereby the Collector knows it doesn't have to break ties by docID. TFC would then use that to gain performance for all in-order scorers.

Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future.

In fact I think most of the work above is for this (and not the optimizations), and I think migration from interfaces -> abstract classes is important (for 2.9).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

shouldn't we do the interface -> abstract class migration (of Weight & Searchable) under a separate issue?

I think I asked it already but haven't received a concrete answer.

This issue moved TSDC and TFC to always assume "in-order" collection of documents. If BS is used, they will break for the documents that compare equally to the top of the queue. Therefore we wanted to be able to create the right variant of TFC/TSDC depending on the Scorer we're going to get.

I offered somewhere above (or at least intended to) that we keep the changes I've done to TSDC and TFC (allowing to create in/out-of order variant) and in IndexSearcher always ask for out-of-order, then in a separate issue make all these changes and really take advantage of the in-order variants unless BS is used (or any other future Scorer for that matter).

I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind?

OK that sounds great!

So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order).

So the current patch on this issue needs to be redone, right? (Eg to remove Query.scoresDocsInOrder, etc.).

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order).

I assume the possibility to guarantee always in-order scorer will remain after all these changes?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I assume the possibility to guarantee always in-order scorer will remain after all these changes?

Correct.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch includes all discussed changes, and defaults TSDC and TFC to out-of-order collection. It also covers the changes to the tag.

Note that currently BS and BS2 check if they should init in next()/skipTo/score - I will fix it in the other issue after I separate between them (i.e., not having BS2 instantiate BS), via a topScorer() or something.

All tests pass

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good! I can confirm that all tests pass. (Though the back-compat tag has moved forward – I just carried that part of the patch forward). Thanks Shai.

Some small comments:

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code?

OK I agree, let's not do this one for now.

New patch looks good – I'll review it some more and then wait a few days and commit. Thanks Shai!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I made some tiny fixes:

I think it's ready to commit! I'll wait a day or two.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size().

Oh right ... I forgot to roll that back since HitQueue is initialized in those cases to not pre-populate with sentinel values.