Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483]

asfimport commented 15 years ago

This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

Introduces
- MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
- TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.
- FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
- FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.
- FieldComparatorSource - new class to allow for custom Comparators.
Alters
- IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;)
Deprecates
- TopFieldDocCollector
- FieldSortedHitQueue

Migrated from LUCENE-1483 by Mark Miller (@markrmiller), 1 vote, resolved Feb 02 2009 Attachments: LUCENE-1483.patch (versions: 35), LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, sortBench.py, sortCollate.py Linked issues:

2381
- 3793

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

One small error:

public int sortType() {
return SortField.STRING_VA;
}

should be

public int sortType() {
return SortField.STRING_VAL;
}

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

That actually had a system.out as well. Another patch that takes care of the above error and the system.out, and a manual fix of the HitCollector $id.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Mark. FWIW, when I generate a patch (with "svn diff") that is near the $Id$ tag, I too cannot apply the patch. So it seems like "svn diff" has some "smarts" whereby it un-expands a keyword, thus screwing up patch. We really need that "svn patch" command... (which IIRC is coming in an upcoming svn release).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Some more quick comments:

the StringIndex value version does two much array derefing, so I'd fix that, or just use the String version.

I'm dong ordinals by keeping a second Double subord array. Everytime ords are mapped to the new IndexReader, if an ord doesnt map directly onto the new terms array, if the subord is not 0, I multiply the old ord mapping into the sub ord and give it the new ord eg the subord becomes the old ord times the current subord and the ord is updated. When comparing, if two ords are the same, we drop to the subord.

I havn't though about precision issues (it prob doesn't work, but I don't know), but it works for the tests.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

BTW, one important difference w/ the new TopFieldValueDocCollector is it does not track the max score – we probably need to add that back in, until Hits is removed in 3.0 (is it needed beyond that?).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

There's something wrong w/ the SortField.STRING_ORD case – I'm trying "sort by title" w/ a Wikipedia index, and while I see the same results in a clean checkout vs STRING_VAL, I get different (wrong) results for STRING_ORD.

Clean checkout & STRING_VAL get:

hit 0: docID=1974521 title="Born into Trouble as the Sparks Fly Upward."
hit 1: docID=688913 title="Into The Open" Exhibition
hit 2: docID=1648 title="Love and Theft"
hit 3: docID=599545 title="Repent, Harlequin!" Said the Ticktockman
hit 4: docID=349499 title="The Spaghetti Incident?"

but STRING_ORD gets this:

hit 0: docID=599545 title="Repent, Harlequin!" Said the Ticktockman
hit 1: docID=688913 title="Into The Open" Exhibition
hit 2: docID=1974521 title="Born into Trouble as the Sparks Fly Upward."
hit 3: docID=992439 title='Abd al-Malik II
hit 4: docID=1951563 title='Auhelawa language

I haven't tried to track it down yet...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

haven't tried to track it down yet...

I wouldn't. Well the method I am using passes the tests, its not likely viable (maybe even on a smaller scale than i would have guessed based on what your seeing). But since its close, I figure a benchmark of it against using values should tell us a lot about whether it makes sense to keep pushing with ord. Ord will no doubt end up a little slower if its to work properly, but comparing them now should give us a gauge of using values instead. Thats what we were looking for right?

I still have tons I want to look at in this patch (and hopefully some ideas/suggestions). I havn't looked at it at all in that context yet though. I merely sat down and made the tests pass one by one with little consideration for anything else.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Another thing I'll do is added another sort test - the tests may not hit all of the edge cases - i dont think they hit compare(ord, doc, score) at all for one (if i am remembering right).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I just made a quick new test for what im doing with ords - it seems once i add more than about 500 docs, one or two are out of order - that problem compounds as the number goes up. Special case I am missing, or precision issues.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I ran an initial test, though since the ord approach is a "bit" buggy we can't be sure how well to trust these results.

I indexed first 2M docs from Wikipedia, into 101 segment index, then search for "text" (hits 97K results), sort by title, pulling best 100 hits. I do the search 1000 times in each round.

Current trunk (best 107.1 searches/sec):

Operation            round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
XSearchWarm              0        1            1          0.0       93.64   463,373,760  1,029,046,272
XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -   100.6 -  -   9.94 - 463,373,760  1,029,046,272
XSearchWithSort_1000     1        1         1000        107.1        9.34   572,969,344  1,029,046,272
XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -   105.5 -  -   9.48 - 572,969,344  1,029,046,272
XSearchWithSort_1000     3        1         1000        106.2        9.41   587,068,928  1,029,046,272

Patch STRING_ORD (best 102.0 searches/sec):

Operation            round   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
XSearchWarm              0        1            1          0.5        2.16   384,153,600  1,029,046,272
XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -  - 94.1 -  -  10.63 - 439,173,824  1,029,046,272
XSearchWithSort_1000     1        1         1000        100.7        9.93   439,173,824  1,029,046,272
XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -   101.9 -  -   9.81 - 573,822,208  1,029,046,272
XSearchWithSort_1000     3        1         1000        102.0        9.81   573,822,208  1,029,046,272

Patch STRING_VAL (best 34.6 searches/sec):

XSearchWarm              0        1            1          0.4        2.24   368,201,088  1,029,046,272
XSearchWithSort_1000 -   0 -  -   1 -  -  - 1000 -  -  - 34.6 -  -  28.94 - 415,107,648  1,029,046,272
XSearchWithSort_1000     1        1         1000         33.9       29.54   415,107,648  1,029,046,272
XSearchWithSort_1000 -   2 -  -   1 -  -  - 1000 -  -  - 33.9 -  -  29.46 - 545,339,904  1,029,046,272
XSearchWithSort_1000     3        1         1000         34.0       29.40   545,339,904  1,029,046,272

Notes:

Populating the field cache on trunk for MultiReader is fantastically costly (94 sec). The IO cache was already hot so this isn't IO latency. I think MultiTermEnum/Docs behaves badly for this use case (single unique term (title) per doc). We really need to switch to column-stride fields, not un-invert, for this.
For this case at least STRING_ORD is still quite a bit faster than STRING_VAL; however, it's still buggy. Maybe a smaller queue size (eg 10 or 20) would make them closer.
STRING_ORD is still a bit slower than trunk's sort; hopefully once tuned it'll be closer.

I think we now need to fix the STRING_ORD bug & retest.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thanks Mike. Yup - that says enough for me. We have to get ords working. I don't think ords will get faster though (you wouldnt surprise me though), I think they will get slower. Certainly they should stay much faster than value though - what a dog. But who knows - we can test fall back to value compare instead of subords as well. My naive ords attempt now is keeping previous mapping ords order by multiplying it as we move to new readers - thats going to explode into some very large numbers pretty fast, and I don't expect we can get by so easily. Either fall back to value will be good enough, or we will prob have to map to new ords rather than simple multiplying to retain each stages ording.

I'll keep playing with the ord on my end though - i only got it to pass those tests moments before that patch went up. I try to keep a frantic pace because I never know if my spare cycles will go away - I have juggled a defensive plate for a while :) No doubt I'll squeeze some more hours tonight though.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'm starting to think fall back to value wont be so bad. I'll give you another cut that does fall back to value and whatever I have to do to get correct subord stuff right.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We should definitely try fallback compare-by-value.

But, sort by title (presumably unique key) is actually a worst case for us, because all values in the queue will not exist in the next segment. So it's a good test ;) We should also test sorting by an enum field ("country", "state").

Thinking more about how to compute subords... I think we could store ord & subord each as int, and then efficiently translate them to the next segment with a single pass through the queue, in sort key order. This would ensure we hit all the dups (different Strings that map to the same ord in the next segment, but different subords) in one cluster. And, the subord could be easily computed by simply incrementing (starting with 1) in key sort order, until the cluster is done.

It should be simple to step through the pqueue's heap in sort order min->max (w/o removing the entries which is the "normal" heapsort way to sort the elements); you'd need to maintain some sort of queue to keep track of the "frontier" as you walk down the heap. But I haven't found a cookbook example yet... It should be fast since we can use the ord/subords in the queue for all within-queue comparisons.

We could also save time on the binary search by bounding the search by where we just found the last key. It may be worth tracking the max value in the queue, to bound the other end of the search. For a big search the queue should have a fairly tight bound.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

We should also test sorting by an enum field ("country", "state").

I'm actually trying on random data that I am creating, and I'm getting different results. Oddly, many cases where Values seem to beat Trunk.

We could also save time on the binary search by bounding the search by

where we just found the last key. It may be worth tracking the max value in the queue, to bound the other end of the search. For a big search the queue should have a fairly tight bound.

Right, I've been thinking of ways to cut that down too. Its def binary searching much more then it needs to. That seems to be a big ord slowdown from what I can tell - fallback to compare by value is actually appearing slower than by value. I've made a couple small optimizations, but theirs def more.

Thinking more about how to compute subords...

Cool. Great idea to think about. My main search has been to get subord as an int. Getting both to int would certainly be optimal. Everything I've come up with seems too expensive though - I'll try to run with that idea.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ugh – that approach won't work, because the pqueue in the collector is not necessarily sorted primarily by our field (eg if the String sort field is not the first SortField). So we don't have a fast way to visit the keys in sorted order.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, I just full on did it inefficiently, and maybe I can work backwards a little. Seems to be solid.

I just collect the old ords by making a map with the mapped to index as the key. The map has List values and when a list gets more than one entry, the entry is added to a morethanone set. After mapping all the ords, i go through the morethanone set and sort each list - each subord is then set based on its index in the sorted list.

We already knew that was easy enough - I just think its probably on the terribly inefficient side. Now to think about whacking pieces off. It just makes me not very hopeful to start at something so slow. And still the double ords :( Perhaps the negative int could still come into play though.

Way too many little objects being made...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can you post your inefficient remapping version?

I too created the fallback version (just set subord to -1 when value isn't found in the new index & respect that in the two compare methods). I confirmed it gives correct top 50 results. This then brings ord perf to 97.6 searches/sec (vs trunk 107.1 searches/sec), so that's our number to beat since it seems to be bug-free.

Then I ran "MatchAllDocsQuery" (to test a much larger result set – this returns 2M hits but previous query "text" returned \~97K hits), sorting by title, queue size=100. Trunk still has unbelievely slow warming (95 sec), and then gets 7.6 searches/sec. Patch ord search (with fallback) gets 30.7 searches/sec.

This is very interesting and odd – I can't explain why ord searching w/ fallback is so much faster than current trunk when the number of hits is large (2M). I think this is very important because it's the big slow queries that are most important to improve here, even if it's at some cost to the queries that are already fast.

Ie, we still need to do more tests, but if this result holds (and we need to explain the difference), I think it's a strong vote for the ord+fallback approach. Not to mention, it also sidesteps the absurdly slow warming time of FieldCache.StringIndex on a Multi*Reader.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

When I run MatchAllDocsQuery (2M hits), with a queue size of 10 instead of 100, trunk still gets 7.6 searches/sec and ord w/ fallback gets 33.1 searches/sec.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I tested the query "1", which gets 384K hits. Trunk gets 48.1 searches/sec and ord w/ fallback gets 55.0.

So somehow as the result set gets larger, with a crossover somewhere between 97K hits (query "text") and 384K hits (query "1"), the ord w/ fallback becomes faster and then gets much faster as the result set gets quite large (2M hits).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, here is the super inefficient ords version.

SortField.STRING_VAL:   sort by val
SortField.STRING_ORD: sort by ord and subord
SortField.STRING_ORD_VAL: sort by ord fallback to val

Those multireader fieldcache loading times blow me away...crazy.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I too created the fallback version (just set subord to -1 when value

isn't found in the new index & respect that in the two compare methods).

Yours may be better than my then - i got rid of the subord array for fallback, and if they are equal ords, i do a compare on the values.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This is very interesting and odd - I can't explain why ord searching w/ fallback is so much faster than current trunk when the number of hits is large (2M).

I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9) and a lot of what I was doing, straight values seemed to handily beat straight ords! It depended on how many docs I was making and how segmented I made it I think. Wasn't very official, and the test was somewhat short, but I ran it over and over...seemed odd.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I guess I can't get away with my new index calculation either: ords[i] = ((-index << 1) - 3) / 2.0d;

Index is an int, so its going to overflow.

EDIT

Wait...thats based on unique terms per reader not possible number of docs...guess it can stay.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Or can it? Over int.max/2 unique ids and then sort on id would be broke right? Okay, it would be kind of nuts to try and sort on that many unique terms, but in the future?...

EDIT

Actually, one seg would need int.max /2, but you know what i mean...

EDIT

Okay, I guess my argument with JIRA cleared up - you'd have to have the second segment or later with over int.max/2 terms. Do we care about such an insane possibility?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I was testing with randomly created data (just random number of digits (2-8), each digit randomly 0-9)

Can you post a patch for this? Seems handy to fix contrib/benchmark to be able to generate such a field...

> straight values seemed to handily beat straight ords!

I'll try to test this case too; we need to understand why natural data (Wiki titles) shows one thing but synthetic data shows the opposite. And we still need to test the enum case.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I guess I can't get away with my new index calculation either: ords[i] = ((-index << 1) - 3) / 2.0d;

I think you can use an int for the ords? Now that we have subord, when you get negative index back from binary search, you can set ord to -index-1 which is the "lower bound", and then as long as subord is at least 1 it should compare correctly.

Also, in your 2nd pass, if the list is length 1 then you can immediately set subord to 1 and move on.

In your first pass, in the "else" clause (when the value was found in the next segment) don't you need to set subord to 0?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think you can use an int for the ords? Now that we have subord, when you get negative index back from binary search, you can set ord to -index-1 which is the "lower bound", and then as long as subord is at least 1 it should compare correctly.

Ah, indeed - no need for in the middle if you can fall to the subord. I think I may have been thinking it would be nice not to fall through the ords so much, but surly its worth losing double and doing the average. Seems to like -2 not -1 though, then I start the subords at 1 rather than 0...I'll bring real thought to it later, but thats generating some good looking results.

EDIT

Arg - with -2 one tests fails, with -1 it passes, but reverse sort fails :) Prob is 1 one then, and I've got a small issue elsewhere.

In your first pass, in the "else" clause (when the value was found in the next segment) don't you need to set subord to 0?

hmm...I had that commented out as I was playing...let me think - if you don't set it and multiple numbers don't map to clean new ords and they are the same, it will fall to subords...so right, you wouldn't want an old subord around.

I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though.

I'm going to take some time to finish up some other pieces and then come back again I think. I've polished up the comparators a bit, so once I get some other work in ill put up another rev.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation...

I'm not sold either way. Push to scorer?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But, with this new approach (single pqueue that "knows" when we transition to the next segment) aren't we back to HitCollector not only knowing the doc base but also the IndexReader we are now advancing to? (We should remove the setDocBase call).

Or I guess we could pre-add the doc (in Scorer) so that collect is called with the full docID.

Still that "tiny" performance cost nags at me ;) Most of the time the add would not have been necessary since the number of inserts into the pqueue should be a small percentage for a large number of hits. And this is the hotspot of searching for Lucene, so maybe we should not add on this cost even if it's tiny? And we can always wrap a collector that doesn't implement setIndexReader and pre-add the docId for it. It's like an "expert" DocCollector API vs the normal one.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Prob is 1 one then, and I've got a small issue elsewhere.

I think we want ord to be the lower bound. EG if seg 1 has:

apple -> 0 banana -> 1 orange -> 2

and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think?

And an exact match (apple & orange in this case) should have subord 0.

> I'm hoping there is a lot more optimization we can do to the pure ords case, but frankly I have low hopes for it - it should be a bit more competitive though.

I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk. Not to mention much faster warming time (side stepping the weirdness with Multi*Reader, and, only loading FieldCache for new segments).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, I guess thats fair enough. I was going to push down the fall back sorted search (with the old Custom's) to a single reader too, but thats not actually worth keeping setdocbase for (or needed now that I think about it). So setNextReader brings the same abstraction argument though. But what can you do I guess - the benefits are clearly worth it, and those comparators need access to current subreader.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think the perf gains are very compelling, already, for the ord fallback case & title sorting. Small result sets are slower, but large result sets are substantially faster, than current trunk.

Oh, I agree there - I think this patch still makes perfect sense - its brings lot of gains. I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

and then seg 2 has just apple & orange, then banana should map to ord 0 subord 1, meaning it's between ord 0 & 1, I think?

apple -> 0 orange ->1

the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple. Then on sort, apple comes first for 0, 1 and then orange is 0, 2.

(I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> the binary search gives back -insertionpoint - 1, the insertion point for banana is 1, so -1 -1 = -2. So I reverse that and subtract 2 to get 0 right? It lands on apple.

Hmm – I didn't realize binarySearch is returning the insertion point on a miss. So your logic (negate then subtract 2) makes perfect sense now.

Just be sure... maybe you should temporarily add asserts when a negative index is returned that values[-index-2].compareTo(newValue) <0 and values[-index-1] > 0 (making sure those array accesses are in bounds)?

> (I dont remember off hand why subord has to start at 1 not 0, but i remember it didnt work otherwise)

This is very important – that 1 is "equivalent" to the original 0.5 proposal, ie, think of subord as the 2nd digit in a 2-digit number. That 2nd digit being non zero is how we know that even though banana's ord landed on apple's, banana is in fact not equal to apple (because the subord for banana is > 0) and is instead between apple and orange.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I just don't think that ords without fallback is going to get very good. I'm wondering if we should even try too hard if ord with val fallback does so well.

Maybe we can try a bit more (I'll run perf tests on your next iteration here?) and then start wrapping things up? Progress not perfection! We can further improve this later.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'm on board with whatever you think is best.

I'll keep playing with ords.

I spent some time last night putting in most of the rest of the cleaup/finishup that was left outside of the comparators. Theres a handful of non SortTest classes tests that still fail though, so I still have to fix those. I'll do that, give ords a little play time, and then I think the patch will be fairly close. Then we can take it in and bench on a fairly close to done version.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have still one question: Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

If its not needed, lets get rid of it. We don't want to deprecate HitCollector if we don't have to. The main reason I can see that we are doing it at the moment is that the TopFieldValueDocCollector needs that hook so that it can set the next IndexReader for each Comparator. The Comparator needs it to create the fieldcaches and map ords from one reader to the next. Also, it lets us do the docBase stuff, which is nice because you add the docBase less often if done in the collector.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Why do we need the new DocCollector? Is this really needed? Would it be not OK to just add the offset before calling collect()?

I'd like to allow for 'expert' cases, where the collector is told when we advance to the next sequential reader and can do something at that point (like our sort-by-field collector does).

But then still allow for 'normal' cases, where the collector is unchanged with what we have today (ie it receives the "real" docID).

The core collectors would use the expert API to eke out all performance; external collectors can use either, but the 'normal' one would be simplest (and match back compat).

So then how to "implement" this approach... I would actually be fine with keeping HitCollector, adding a default "setNextReader" method, that either throws UOE or (if we are strongly against exceptions) returns "false" indicating it cannot handle sequential readers.

Then when we run searches we simply check if the collector is an "expert" one (does not throw UOE or return false from setNextReader) and if it isn't we wrap it with DocBaseCollector (which adds the doc base for every collect() call).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hmmm...we had a reason for deprecating HitCollector though. At first it was to do the capability check (instance of HitCollector would be wrapped), but that didn't pan out. I think we also liked it because people got deprecation warnings though - so that they would know to implement that method for 3.0 when we would take out the wrapper.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> so that they would know to implement that method for 3.0 when we would take out the wrapper.

Right but the new insight (for me at least) is it's OK for external collectors to not code to the expert API.

Ie previously we wanted to force migration to the expert API, but now I think it's OK to allow normal API and expert API to exist together.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, I hate the idea of leaving in the wrapper, but it is true thats too difficult of a method for HitCollector (to be required anyway). setReader is a jump in understanding above setDocBase, which was bad enough.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hey Mike how about this one? BooleanScorer can collect hits out of order if you force it (against the contract). I think its an issue with basedoc type stuff.

Actually I'll clarify that - I think its an issue with the multple reader mojo - didnt mean to put it solely on adding bases in particular yet.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> BooleanScorer can collect hits out of order if you force it (against the contract).

Hmmm... right. You mean if you pass in allowDocsOutOfOrder=true (defaults to false).

I think this should not be a problem? (Though, I really don't fully understand BooleanScorer!). Since we are running scoring per-segment, each segment might collect its docIDs out of order, but all such docs are still within the current segment. Then when we advance to the new segment, the collector can do something if it needs to, and then collection proceeds again on the next segment's docs, possibly out of order. Ie, the out-of-orderness never jumps across a segment and then back again?

But this is a challenge for #1906, if we go with a primarily iterator-driven API.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I didnt think it should be a problem either, since we just push everything to one reader; But it seems to be - the only test not passing involves allowDocsOutOfOrder=true. Do the search with it true, do the same search with it false, gets 3 and 4 docs. 2 or 3 tests involving that fail. I don't have time to dig in till tonight though - thought you might shortcut me to the answer :)

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

I would actually be fine with keeping HitCollector, adding a default "setNextReader" method, that either throws UOE or (if we are strongly against exceptions) returns "false" indicating it cannot handle sequential readers.

Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not?

I really don't fully understand BooleanScorer!

The original version of BooleanScorer uses a \~16k array to score windows of docs. So it scores docs 0-16k first, then docs 16-32k, etc. For each window it iterates through all query terms and accumulates a score in table[doc%16k]. It also stores in the table a bitmask representing which terms contributed to the score. Non-zero scores are chained in a linked list. At the end of scoring each window it then iterates through the linked list and, if the bitmask matches the boolean constraints, collects a hit. For boolean queries with lots of frequent terms this can be much faster, since it does not need to update a priority queue for each posting, instead performing constant-time operations per posting. The only downside is that it results in hits being delivered out-of-order within the window, which means it cannot be nested within other scorers. But it works well as a top-level scorer. The new BooleanScorer2 implementation instead works by merging priority queues of postings, albeit with some clever tricks. For example, a pure conjunction (all terms required) does not require a priority queue. Instead it sorts the posting streams at the start, then repeatedly skips the first to to the last. If the first ever equals the last, then there's a hit. When some terms are required and some terms are optional, the conjunction can be evaluated first, then the optional terms can all skip to the match and be added to the score. Thus the conjunction can reduce the number of priority queue updates for the optional terms. Does that help any?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Could we instead add a new HitCollector subclass, that adds the setNextReader, then use 'instanceof' to decide whether to wrap or not?

Woah! Don't make me switch all that again! I've got wrist injuries here :) The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector. Mike seemed against deprecating those if we could get away with it, so I've since dropped that. I've already gone back and forth - whats it going to be ? Ill admit I don't like using the exception trap I am now, but I dont much like the return true/false method either...

Edit

Ah, I see, you have a new tweak on this time. Extend HitCollector rather then HitCollector extending the new type...

Nice, I think this is the way to go.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Woah! Don't make me switch all that again!

Sorry, I'm just tossing out ideas. Don't take me too seriously...

> The reason I lost the instanceof is that we would have to deprecate the HitCollector implementations because they need to extend HitCollector.

Would we? I was suggesting that, if we're going to have two APIs, one expert and one non-expert, then we could make the expert API a subclass and not deprecate or otherwise alter HitCollector. I do not like using exceptions for normal control flow. Instanceof is better, but not ideal. A default implementation of an expert method that returns 'false', as Mike suggested, isn't bad and might be best. It requires neither deprecation, exceptions nor instanceof. Would we have a subclass that overrides this that's used as a base class for optimized implementations?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Would we have a subclass that overrides this that's used as a base class for optimized implementations?

If we do this, I don't think we need a new base class for "expert" collectors; they can simply subclass HitCollector & override the setNextReader method?

Though one downside of this approach is the "simple" HitCollector API is polluted with this advanced method, and HitCollector's collect method gets different args depending on what that method returns. It's a somewhat confusing API.

I guess Id' actually prefer subclassing HitCollector (SequentialHitCollector? AdvancedHitCollector? SegmentedHitCollector?), adding setNextReader only to that subclass, and using instanceof to wrap HitCollector subclasses.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

>> Woah! Don't make me switch all that again!

>Sorry, I'm just tossing out ideas. Don't take me too seriously...

Same here. If you guys have a 100 ideas, id do it 100 times. No worries. Just wrist frustration :) I misunderstood you anyways.

It requires neither deprecation, exceptions nor instanceof.

Okay, fair points. I guess my main dislike was having to call it, see what it returns, and then maybe call it again. That turned me off as much as instanceof. I'm still liking the suggestion you just made myself...

Mike?

apache / lucene

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483] #2557

2381

3793