apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

Optimizations to TopScoreDocCollector and TopFieldCollector [LUCENE-1593] #2667

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

This is a spin-off of #2649 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is:

  1. Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs().
  2. Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete.
  3. Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null.
  4. Also move to use "changing top" and then call adjustTop(), in case we update the queue.
  5. some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in).
  6. Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without "arranging" it, just store the objects in the array (this can be used to pre-populate sentinel values)?

I will post a patch as well as some perf measurements as soon as I have them.


Migrated from LUCENE-1593 by Shai Erera (@shaie), 1 vote, resolved May 09 2009 Attachments: LUCENE-1593.patch (versions: 5), PerfTest.java

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

See https://issues.apache.org/jira/browse/LUCENE-1575?focusedCommentId=12697152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12697152 for some specifics.

Re #2: docID must still be used as a tie-breaker in the "lessThan" method, since that is used by pq during up/down heap. We don't need to use docID as tie-breaker when testing a new hit for insertion.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Change TSDC and TFC's code to not use the doc id as a tie breaker.

We have to remember that out-of-order docID collectors (BooleanScorer) cannot apply this optimization. When we get to this, lets make sure we have a test case that does out-of-order collection, with a field search, that shows failure if we fail to break ties by docID.

Actually, the optimization can be applied sometimes: if we can push bottomVal testing down to each TermDocs enum (much like unfactoring random-access filter down to the TermDocs), that first pass collection in BooleanScorer is in fact in order. But that's a deeper optimization...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

This sounds like it should work, but since I'm not fully familiar with that code, I can't back you up :). I guess a test case will clarify it. In the meantime, I read BooleanScorer and BooleanScorer2 code, and came across several possible optimizations:

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Those look like great optimizations!

I think we can indeed fix BS to send docs in order to the collector, and that allows us to not fallback to docID on hitting ties, even when using BS.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I noticed that IndexSearcher has a ctor which allows to define whether documents are intended to be collected in-order of out-of-order. I assume we have to remove that ctor and also sortSubReaders? If we change TSDC and TFC to always assume docs come in order, this cannot be configurable.

I thought these were introduced as part of #2557 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I thought these were introduced as part of #2557 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it?

Right, this was added in #2557, and I think we should remove it, ie IndexSearcher will always visit docIDs in order again.

The optimization was done for the StringValCollector: that collector "in general" has better performance if you give it the biggest segments first, since the queue should converge faster and require less fallback to by-value comparisons. But I believe that is a minor optimization in general since segments tend to be biggest to smallest, already, except for indexes with many deletes on the older segments.

My feeling is we can proceed without testing it; I believe this "not breaking tie by docID" will be a bigger win.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap[1]. This saves calls to top() immediately following these two. Any objection? Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it. I'll also fix TSDC and TFC to not call top() after that. Saves 1 more method call.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap[1].

Sounds good...

Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it.

Hmm what about JAR drop-in-ability? The method signature will have changed. ("ant test-tag" now test JAR drop-in-ability so you could try it).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

what about JAR drop-in-ability?

Darn it, you're right. It changes the method's signature and test-tag fails. I'll revert that change. Perhaps we can do that in 3.0 then? Is it acceptable to document this API change in CHANGES and still proceed with it in 2.9?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Wait, I think I was wrong before. I tried ant test-core and noticed many errors such as NoSuchMethod etc. so I figured I should clean first. So I "ant clean" and then "ant test-core" - all tests passed. I then "ant clean-tags" and "ant test-tags" and all tests passed too. So I guess this change is ok and does not break back-compat?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So I "ant clean" and then "ant test-core" - all tests passed

Hmmm something weird is going on.

I ran this test:

and I hit errors like this:

    [junit] java.lang.NoSuchMethodError: org.apache.lucene.search.PhraseQueue.put(Ljava/lang/Object;)V
    [junit]     at org.apache.lucene.search.PhraseScorer.sort(PhraseScorer.java:139)
    [junit]     at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:133)
    [junit]     at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:76)
    [junit]     at org.apache.lucene.search.Scorer.score(Scorer.java:67)
    [junit]     at org.apache.lucene.search.IndexSearcher.doSearch(IndexSearcher.java:269)
    [junit]     at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257)
    [junit]     at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181)
    [junit]     at org.apache.lucene.search.Searcher.search(Searcher.java:181)
    [junit]     at org.apache.lucene.TestSearch.doTestSearch(TestSearch.java:124)
    [junit]     at org.apache.lucene.TestSearch.testSearch(TestSearch.java:58)
    [junit]     at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)

(which is what I expected).

Yet, when I run "ant clean test-tag" with that same change, I don't see a failure... I fear "ant test-tag" is somehow not doing the right thing. I'll go reopen that issue...

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

According to my understanding, test-tag runs the 2.4 tests against the 2.9 jar. Does it compile the code against the 2.4 sources or against the 2.9 jar? If the latter, then I think that explains it - the compiler sees the updated PQ and therefore there's no problem. If it compiles against the 2.4 source, then I think we should see the problem, which is in my understanding the reason why we see it before calling "clean".

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I thought what it was supposed to do is compile the 2.4 tests against the 2.4 JAR, then run the compiled 2.4 build/classes/test/* files against a 2.9 jar; this was the reason #2603 was opened (to explicitly test "jar drop-in-ability").

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok, so with the latest changes to #2603, I ran "ant clean clean-tags test-tags" and I see this:

    [junit] Testcase: testPQ(org.apache.lucene.util.TestPriorityQueue): Caused an ERROR
    [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
    [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
    [junit]     at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:61)
    [junit]     at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:48)
    [junit] Testcase: testClear(org.apache.lucene.util.TestPriorityQueue):  Caused an ERROR
    [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
    [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
    [junit]     at org.apache.lucene.util.TestPriorityQueue.testClear(TestPriorityQueue.java:90)
    [junit] Test org.apache.lucene.util.TestPriorityQueue FAILED

TestPQ fails as Mike expected.

Since this breaks back-compat, can we still do this in 3.0? If so, I'll add it to #2675 even though it's not a direct followup of 1575. What do you think?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

In general, 3.0 gives us no extra in regards to back compat freedom. All it gives is the opportunity to remove all the deprecated stuff. So generally, if you can't do it through deprecation, you can't do it. Except that exceptions for certain complicated/beneficial issues have been made I think (ie Field has slightly loosened requirements) with proper notification and minimal impact. That probably comes down to the size of the jolt to the community.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I thought that when we move to 3.0 we're not expected to support jar drop-in-ability? Those two methods are already declared final, so no one could override them. The only problem now is with jar drop-in-ability.

If I'm wrong, and we are suppose to support it, then I can offer new names for these two and deprecate them:

You can suggest yours too. However I do think that if jar drop-in-ability is not required in 3.0, we should not change the names and just document the impl changes in 3.0.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0.

Here are the back compat requirements in detail: http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0.

This might work, but 3.0 is also about moving to Java 5 with all the implications. If my app is already on Java 5, then a jar drop is all that'll be required. But if not, I need to update my app anyway. In addition, there are some changes in runtime behavior that are going to be made in 3.0. My point is - I don't know who will actually upgrade to 3.0 by just dropping a jar.

But anyway, I'm not going to argue with policies - you seem to know better than me about Lucene's back-compat requirements. So the question is whether we want to deprecate these methods and add the new ones, and if so, can we agree on the new names (add, updateTop)?

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

Lucene failed drop in replacement between 2.3 and 2.4, so I don't see any reason not to do it, this time deliberately, for 3.0. :)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

if so, can we agree on the new names (add, updateTop)?

I think it makes sense to add these, returning the min value (and deprecate the old ones).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Few updates (it's been long since I posted on this issue):

I hope you were able to follow my description. That's why I don't know which one is the true output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will get in the end. Few more comments:

  1. TestSort.runMultiSort contains 3 versions of this test:
    • One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the output was not determined to be exact and so the pattern included is [ABE]{3}, as if the test's output is not predicted. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code.
    • The second explicitly sets INT followed by DOC Sorts, and expects [AB]{2}E.
    • The third relies on setSort's adding DOC, so it expects the same [AB]{2}E.
  2. The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's start).

Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might result in someone experiencing a change in behavior (if he used MultiSearcher), even if that behavior is buggy.

What do you think?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something)

I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom?

For String we should be able to use U+FFFF.

So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher).

Hmm – good catch! I think this is in fact a bug in MultiSearcher (AEB is the right answer): it's failing to "break ties" (by docID) properly. Ie, it will not "match" what IndexSearcher(MultiSegmentReader(...)) will do.

I think we could fix this by allowing one to pass in a docbase when searching? Though that's a more involved change... could you open a new issue for this one?

I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code.

I agree – let's fix it to be a deterministic test?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think?

I was wrong about this – you also get "out of order ness" due to 2nd, 3rd, etc. clauses in the boolean query adding in new docs. Re-ordering those will be costly.

But: in the initial collection of each chunk, we do know that any docID in the queue will be less than those being visited, so this could be a good future optimization for immediately ruling out docs during TermScorer. That's a larger change... eg it'd require being able to say "I don't need exact total hit count for this search".

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom?

That's not so simple. Let's say I initialize the sentinel of IntComp to Integer.MAX_VALUE. That should have guaranteed that any 'int' <MAX_VAL would compare better. But the code in TFC compares the current doc against the 'bottom'. For all Sentinels, it means MAX_VAL. If the input doc's val < MAX_VAL, it compares better. Otherwise, it is rejected, because:

If it is bigger than the bottom, it should be rejected.

If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected.

Actually, the tie is broken only after it's in queue, when the latter calls compare(). This assumption removed the 'if' that checked for doc Id value, so if I reinstate it, we don't really gain anything, right (replacing 'if (queueFull)' with 'if (bottom.doc > doc + docBase)')?

For String we should be able to use U+FFFF.

If we resolve the core issues with sentinel values, than this will be the value I'd use for Strings, right.

I think we could fix this by allowing one to pass in a docbase when searching?

I actually would like to propose the following: MultiSearcher already fixes the FD.doc before inserting it to the queue. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. This can happen just after searcher.search() returns, and before MS adds the results to its own FieldDocSortedHitQueue. I already did it, and all the testMultiSearch cases fail, but that's because they just assert that the bug exists :). If you think a separate issue is still required, I can do it, but that would mean that the tests will fail until we fix it, or I don't change Sort in this issue and do it as part of the other one.

let's fix it to be a deterministic test?

Will do, but it depends - if a new issue is required, I'll do it there.

I was wrong about this

I must say I still didn't fully understand what do you mean here. I intended to keep that to after everything else will work in that issue's scope, and note if there are any tests that fail, or BQ actually behaves properly. So I'll simply count on what you say is true :), since I'm not familiar with that code.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected.

Argh, you are right... so the approach will fail if any of the first topN (eg 10) hits have a field value equal to the sentinel value. I guess we could do two separate passes: a "startup" pass (while queue is filling up) and a "the rest" pass that knows the queue is full. But that's getting rather ugly; probably we should leave this optimization to source code specialization.

I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC.

Excellent! Let's just do this as part of this issue.

I must say I still didn't fully understand what do you mean here.

I also did not understand how BooleanScorer works until Doug explained it (see the comment at the top).

Right now it gathers hits of each clause in a reversed linked list, which it then makes a 2nd pass to collect. So the Collector will see docIDs in reverse order for that clause. I thought we could simply fix the linking to be forward and we'd have docIDs in order. But that isn't quite right because any new docIDs hit by the 2nd clause will be inserted at the end of the linked list, out of order.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Same for BS2.score() and score(HC) - initCountingSumScorer?

I proposed this change, but it is problematic. initCountingSumScorer declares throwing IOE, but neither any of the ctors or add() declares it. So I cannot add a call to initCountingSumScorer without breaking back-compat, unless:

What do you think about the 2nd option?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

BooleanScorer2 is package private, so I think it's fine to add "throws IOException" to the ctor & add, as long as that doesn't then percolate up to requiring adding throws IOException to a public API (does it? Seems likely not to).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I'm ready to attach the patch, still need to run some perf measurements with it, but that's orthogonal. I ran "ant clean test" and all tests passed, except for TestSort in tag. The reason is because the test is invalid (asserts that a bug exists), and so we'll need to change TestSort in tag in order for it to pass, when this issue is committed (and the bug in MultiSearcher is fixed).

How should I do it? Just attach a separate patch for tag? This also warrants a change in common-build.xml, to upgrade to the new tag, which I don't know how it will be named yet. Which brings me back to a question I asked recently on #2603 - why can't we test the branch by default in test-tag, which means the latest tag? We already today test against the latest tag, only we need to define it explicitly, which leaves room for errors in case we forget to tag it, or change common-build.xml accordingly. Instead, we can allow one to specify a tag name (as -D option) and if specified, we test against that tag name, or otherwise we test against the branch. Wouldn't that:

Perhaps the latter can even let us to change 2.4's branch a couple of times during a release, but create a single tag that's compliant with that release in its end? E.g., we've already created 1 tag because of changes to the trunk (future 2.9) and now we'll need another change. But from what I understand, When 2.9 comes out, we only need one 2.4 tag which matches the changes done against 2.9?

Anyway, that's a side discussion. For this issue, we need to update TestSort in 2.4 branch regardless. Whatever we decide about branch/tag policy will only affect whether I'll need to update common-build.xml before/after the tag is created.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

as long as that doesn't then percolate up to requiring adding throws IOException to a public API

Like I said, I did the change and it was local to BS2. The rest of Lucene's code which uses it already has a throws IOE declaration. Good - then I'll proceed with that change.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How should I do it? Just attach a separate patch for tag?

That would be great! Hopefully I can succeed in committing it, retagging, etc., w/o breaking anything, this time.

why can't we test the branch by default in test-tag, which means the latest tag?

This will cause people's checkouts to suddenly (unexpectedly) start failing "ant test-tag". EG if I have a checkout from 2 weeks ago, where I'm developing something, then suddenly someone commits a fix to the back compat branch, if my checkout simply uses the branch, I'll suddenly start seeing the back compat tests [incorrectly] fail.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok, I understand that now. So if you develop something, and your common-build.xml is set against tag X, and then we update it to tag X+1, then the next time you'll update from SVN and run "ant test-tag", it will detect the tag is missing and download it.

So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file? I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent :).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file? I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent .

Well, if we did branch only (no tag), we can't auto-update your checkout of the branch, unless we also auto-update your main checkout (sometimes we change internal APIs). I don't think we should auto update your main checkout when you run "ant test-tag".

Why not just keep using the current tag approach? It works correctly... and it's not so bad to retag whenever we have to change back compat tests.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Nothing wrong with that. Was just trying to think of a more automated approach, that will remove some responsibilities from the committers. It's not that important. So if there's no easy way to do it, I'm fine if we leave it as is.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

The patch implements all that has been suggested except:

The patch also includes the fix for TestSort in the 2.4 back_compat branch. I only fixed TestSort, and not MultiSearcher and ParallelMultiSearcher.

All tests pass.

I also ran some performance measurements (all on SRV 2003):

JRE sort best time (trunk) best time (patch) diff (%)
SUN 1.6 int 1017.59 1015.96 \~1%
SUN 1.6 doc 767.49 763.20 \~1%
IBM 1.5 int 1018.77 1017.39 \~1%
IBM 1.5 doc 768.10 764.14 \~1%

As you can see, there is a slight performance improvement, but nothing too dramatic.

You are welcome to review the patch as well as run the PerfTest I attached. It accepts two arguments: <indexDir> and [sort]. 'sort' is optional and if not defined it sorts by doc. Otherwise, whatever you pass there, it sorts by int.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The patch still has various logic to handle the sentinel values, but we backed away from that optimization (it's not generally safe)?

Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order?

I chose to discard that optimization, which only affects next() and skipTo().

Maybe we should add a "start()" method to Scorer, to handle initializations like this, so that next() doesn't have to check every time?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

he patch still has various logic to handle the sentinel values

Are you talking about TSDC? I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? If not, then I can remove it from there as well as the changes done to PQ.

Maybe we should add a "start()" method to Scorer

Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources? And if we document that calling next() and skipTo() without start() before that may result in an unspecified behavior, it will resemble somewhat to TermPositions, where you have to call next() before anything else.

However, this should be done with caution. BS2 calls initCountingSumScorer in two places: (1) next() and skipTo() and (2) score(Collector). Now, in the latter, it first checks if allowDocsOutOfOrder and if so initializes BS, with adding the Scorers that were added in add(). However those Scorers must not be initalized prior to creating BS, since they will be advanced. So now it gets tricky - upon call to start(), what should BS2 do? Check allowDocsOutOfOrder to determine if to initialize or not? And what if it is true but score(Collector) will not be called, and instead next() and skipTo()? We should also protect against calling start() more than once, and in Scorers that aggregate several scorers, we should make sure their start() is called after all Scorers were added ... gets a bit complicated. What do you think?

Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order?

Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6(?) classes). Perhaps we should still attempt to add to PQ if cmp == 0? Or did you have something else in mind?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC?

Woops, sorry, you're right. I'm just losing my mind.

I think the javadoc for PriorityQueue.addSentinelObjects should state that the Objects must all be logically "equal"? Ie we do a straight copy into the pqueue, so if they are not equal then the pqueue is in a messed up state.

Actually that method is somewhat awkward. I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)?

Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources?

Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next() initializatoin" is done? EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)?

Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes).

Yeah that's (the * 2 splintering) is what I was fearing. At some point we should leave this splintering to source code specialization...it's getting somewhat crazy now.

Perhaps we should still attempt to add to PQ if cmp == 0?

That basically undoes the "don't fallback to docID" optimization right?

Or did you have something else in mind?

The 6 new classes is what I feared we'd need to do. Else, with the changes here (that never break ties by docID), TopFieldCollector can't be used with BooleanScorer (which breaks back compat).

I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad?

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)?

Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ...

Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next() initializatoin" is done?

Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can be initialized from BS2 directly. Since both are package-private, we should be safe. BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what you wrote above, but that's just me learning the code), and so can be initialized from there ... hmm I wonder why this wasn't done so far?

I'll give it a try.

That basically undoes the "don't fallback to docID" optimization right?

Right ... it's too late for me :)

I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad?

It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ...

But I wonder from where would we control it ... IndexSearcher no longer has a ctor which allows to define whether docs should be collected in order or not (the patch removes it). The only other place where it's defined is in BooleanQuery's static setter (affects all boolean queries). But BooleanWeight receives the Collector, and does not create it ... So, if we check in IndexSearcher's search() methods whether this parameter is set or not, we can control the creation of TSDC and TFC. And if anyone else instantiates them on his own, he should know whether he executes searches in-order or not. Back-compat-wise, TFC and TSDC are still in trunk and haven't been released, so we shouldn't have a problem right?

Does that sound like a good approach? I still hate to duplicate the code in TFC, but I don't think there's any other choice. Maybe create completely separate classes for TFC and TSDC? although that will make code maintenance even harder.

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

Use FMPP? It is pretty nice and integrates well into maven/ant builds. I'm using it for primitive-specialized fieldcaches.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Forgive my ignorance, but what is FMPP? And to which of the above is it related? :)

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

Forgive my ignorance, but what is FMPP?

Forgive my laziness, http://fmpp.sourceforge.net/ - "What is FMPP? FMPP is a general-purpose text file preprocessor tool that uses FreeMarker templates."

And to which of the above is it related?

to this

It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ...

Mike experimented with generated code for specialized search, I see no reasons not to use the same approach for cases where you're already hand-coding N almost-identical classes. You're generating query parser after all :) For official release FMPP is superior to Python as it can be bundled in a crossplatform manner.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

bq.EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)?

Ok that "somehow forward to BS" is more problematic than I thought so initially. BS2.score(Collector) determines whether to instantiate a new BS, add the scorers and call bs.score(Collector), or to execute the score itself. On the other hand, it uses the same scorers in next() and skipTo(). Therefore there's kind of a mutual exclusiveness here: either the scorers are used by BS or by BS2. They cannot be used by both, unless we clone() them. If we want to clone them, we need to:

hmm I wonder why this wasn't done so far?

I think I understand now ... the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Before that, i.e., when BW returns BS2 it does not know how it will be used, right? The decision is made by IndexSearcher.doSearch depending on whether there's a filter (next()/skipTo() are used) or not (score(Collector)).

So perhaps we should revert back to having start() on DISI? Since IndexSearcher can call start before iterating over the docs, but not if it uses scorer.score(Collector), which is delegated to the scorer. In that case, we should check whether the countingSumScorer was initialized and if not initialize it outselves.

Am I missing something?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ...

Such extensions shouldn't use a sentinel?

Various things spook me about the separate method: one could easily pass in bad sentinels, and then the queue is in an invalid state; the method can be called at any time (whereas the only time you should do this is on init); you could pass in wrong-sized array; the API is necessarily public (whereas with getSentinel() it'd be protected).

We can mull it over some more... sleep on it ;)

Right ... it's too late for me

I've been starting to wonder if you are a robot...

hmm I wonder why this wasn't done so far?

I don't know! Seems like a simple optimization. So we don't need start/end (now at least).

oIt's just that maintaining that class becomes more and more problematic.

I completely agree: this is the tradeoff we have to mull. But I like that all these classes are private (it hides the fact that there are 12 concrete impls).

I think I'd lean towards the 12 impls now. They are tiny classes.

But I wonder from where would we control it

Hmm.. yeah good point. The only known place in Lucene's core that visits hits out of order is BooleanScorer. But presumably an external Query somewhere may provide a Scorer that does things out of order (maybe Solr does?), and so technically making the core collectors not break ties by docID by default is a break in back-compat.

Maybe we should add a "docsInOrder()" method to Scorer? By default it returns false, but we fix that to return true for all core Lucene queries? And then IndexSearcher consults that to decide whether it can do this?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Use FMPP?

I think the first question we need to answer is whether we cutover to specialization for this. At this point I don't think we need to, yet (I think the 12 classes is tolerable, since they are tiny and private).

The second question is, if we do switch to specialization at some point (which I think we should: the performance gains are sizable), how should we do the generation (Python, Java, FMPP, XSLT, etc.). I think it's a long time before we need to make that decision (many iterations remain on LUCENE-1594).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo().

Oh I see: the Scorer cannot know on creation if it's the "top" scorer (score(Collector) will be called), or a secondary one (next()/skipTo(...) will be called).

Hmm yeah maybe back to DISI.start(). I think as long the actual code that will next()/skipTo(...) through the iterator is the only one that calls start(), the BS/BS2 double-start problem won't happen?

Really, somehow, it should be explicit when a Scorer will be topmost. IndexSearcher knows this when it creates it.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think I'd lean towards the 12 impls now. They are tiny classes.

If we resolve everything else, that should not hold us back. I'll do it.

We can mull it over some more... sleep on it.

Ok sleeping did help. Originally I thought that the difference between our thinking is that you think that PQ should know how to construct a sentinel object, while I thought the code which uses PQ should know that. Now I realize both are true - the code which uses PQ, or at least instantiates PQ, already knows how to create those sentinel objects, since it determines which PQ impl to instantiate. I forgot for a moment that PQ is not a concrete class, and anyone using it should create his own specialized PQ, or reuse an existing one, but anyway that specialized PQ should know how to create the sentinel objects and compare them to real objects.

So I'm ok with it - I'll make the following changes, as you suggest:

  1. Add a protected getSentinelObject() which returns null. Use it in PQ.init() to fill the queue if it's not null.
  2. Make the necessary changes to HitQueue.
  3. Remove the addSentinelObjects from PQ and the code from TSDC.

BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it. Generally speaking, the code which instantiates a PQ and the code that uses it must be in sync ... i.e., if I instantiate a PQ and pass it to some other code which just receives a PQ and adds elements to it, that code should not rely on size() being smaller or anything. I don't feel it complicates things ... and anyway someone can always create a PQ impl which receives a boolean that determines whether sentinel objects should be created or not and if not return null in its getSentinelObject().

Maybe we should add a "docsInOrder()" method to Scorer?

I'm not sure that will solve it .. BS2 consults its allowDocsOutOfOrder only if score(Collector) is called, which it then instantiates a BS and delegates the score(Collector) to. So suppose that BS.docsInOrder return false, what will BS2 return? Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). So it cannot consult its own allowDocsOutOfOrder (even though it gets it as a parameter) since depending on how it will be used, the answer is different. BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no?

There are few issues that we need to solve:

  1. A user can set BooleanQuery.setAllowDocsOutOfOrder today, which may trigger BS2.score(Collector) to instantiate BS, which may screw up the Collector's logic if it assumes in-order documents. IndexSearcher creates the Collector before it knows whether BQ is used or not so it cannot do any intelligent checks. I see two possible solutions, which only 1 of them may be implemented now and the other in 3.0:
    1. Add docsInOrder to Weight (it's an interface, therefore just in 3.0), since that seems to allow IS to check if the current query may visit documents out-of-order.

      * Actually, maybe we add it to Query, which is abstract, and in IS we do weight.getQuery().docsInOrder()?

      In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own).

  2. Someone can create his own Collector, but will have no way to know if the docs will be sent in-order or not. Whatever we do, we have to allow people to correctly 'predict' the behavior of their Collectors, that's why I like the BQ static setting of that variant. The user is the one that sets it to true, so he/she should know that and create their appropriate Collector instance.
    • On the other hand, if we choose to add that information to Query, those Collectors may not have that information in hand when they are instantiated ...

So I'm torn here. Adding that information to Query will solve it for those that use the convenient search methods (i.e., those that don't receive a Collector), but provide their own Query impl, since if we add a default impl to Query which returns false (i.e., out-of-order), it should not change the behavior for them. And if they always return docs in-order, they can override it to return true.

About those that pass in Collector ... do we really have a problem? I mean, today that have no choice but to pass in a Collector that expects out-of-order docs, right? We did not make any promises in 2.4.1 regarding documents order? So in the worse case their code will be slightly inefficient by perhaps unnecessarily attempts to insert docs that compare equal to the top of the queue. And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance.

If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2.

I apologize for the long post, but I'd like to verify that I didn't miss anything regarding back-compat and possible usage, so that we make the right decision.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ok sleeping did help.

OK...good morning (afternoon)!

BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it.

Good point – can you update HitQueue's javadocs stating this new behavior?

BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no?

Urgh yeah right. So docsInOrder() should be added to Weight or Query I guess. Maybe name it "scoresDocsInOrder()"?

Add docsInOrder to Weight (it's an interface, therefore just in 3.0)

Aside: maybe for 3.0 we should do a hard cutover of any remaining interfaces, like Weight, Fieldable (if we don't replace it), etc. to abstract classes?

Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo().

I'd really love to find a way to make this more explicit. EG you ask the Weight for a topScorer() vs a scorer(), or something. Clearly the caller of .scorer() knows full well how the instance will be used (top or not)... we keep struggling with BS/2 because this information is not explicit now.

This would also enable BQ.scorer() to directly return a BS vs BS2, rather than the awkward BS2 wrapping a BS internally in its score(Collector) method.

So how about adding a "topScorer()" method, that defaults to scorer()? (Ugh, we can't do that until 3.0 since Weight is an interface).

But actually: the thing calling scoresDocsInOrder will in fact only be calling that method if it intends to use the scorer as a toplevel scorer (ie, it will call scorer.score(Collector), not scorer.next/skipTo)? So BQ.scoresDocsInOrder would first check if it's gonna defer to BS (it's a simple OR query) and then check BS's static setting.

In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own).

This back compat break worries me a bit. EG maybe Solr has some scorers that run out-of-order?

Also this is not really a clean solution: sure, it's only BQ today that does out-of-order scoring, but I can see others doing it in the future. I can also see making out-of-order-scoring more common and in fact the default (again) for BQ, since it does give good performance gains. Maybe other Query scorers should use it too.

So I think I'd prefer the "add scoresDocsOutOfOrder" to Query. Note that this must be called on the rewritten Query.

And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance.

Right, adding the method to Query gives expert users the tools needed to make their own efficient collectors.

If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2.

OK, you mean make this a non-static setter/getter? I would actually prefer to default it to "true", as well. (It's better performance). But that should wait for 3.0 (be sure to open a "for 3.0" followon issue for this one, too!).

BTW, even though BS does score docs out of order, it still visits docs in order "by bucket". Meaning when visiting docs, you know the docIDs are always greater than the last bucket's docIDs.

This gives us a source of optimization: if we hold onto the "bottom value in the pqueue as of the last bucket", and then we can first compare that bottom value (with no tie breaking by docID needed) and if that competes, then we do the "true, current bottomValue with breaking tie by docID" comparison to see if it makes it into the queue. But I'm not sure how we'd cleanly model this "docIDs are in order by bucket" case of the scorer, and take advantage of that during collection. I think it'd require extending the FieldComparator API somehow, eg "get me a BottomValueComparator instance as of right now".

This can also be the basis for a separate strong optimization, which is down in the TermScorer for a BooleanScorer/2, skip a docID if its field value is not competitive. This is a bigger change (way outside the scope of this issue, and in fact more related to LUCENE-1536).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Good point - can you update HitQueue's javadocs stating this new behavior?

I actually prefer to add a boolean prePopulate to HitQueue's ctor. Since it's pacakge-private it should be ok and I noticed it is used in several classes like MultiSearcher, ParallelMultiSearcher which rely on its size() value. They just collect the results from the different searchers. I don't think they should be optimized to pre-populate the queue, so instead of changing their code to not rely on size(), I prefer them to init HQ with prePopulate=false.

If changing the ctor is not accepted, then perhaps add another ctor which accepts that argument?

Regarding the rest, I need to read them carefully before I'll comment.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I actually prefer to add a boolean prePopulate to HitQueue's ctor.

+1