Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483]

asfimport commented 15 years ago

This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).

Introduces
- MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
- TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields.
- FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation.
- FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation.
- FieldComparatorSource - new class to allow for custom Comparators.
Alters
- IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;)
Deprecates
- TopFieldDocCollector
- FieldSortedHitQueue

Migrated from LUCENE-1483 by Mark Miller (@markrmiller), 1 vote, resolved Feb 02 2009 Attachments: LUCENE-1483.patch (versions: 35), LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, sortBench.py, sortCollate.py Linked issues:

2381
- 3793

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mark did you intend to attach the patch here?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I had meant to attach a patch, but then a bunch of stuff wasn't working...

This is still a poor mans patch. I need to switch to using the expose subreaders patch. This also doesnt include the multisearcher sort patch yet, because when I tried the first one (2nd rev) everything broke. I'll work on integrating that later.

I think all tests pass except for the very last sort test.

Some cleanup needed, including the possible drop of using MultiSearcher itself.

Basically, its still in a proof of concept stage.

asfimport commented 15 years ago

Marvin Humphrey (migrated from JIRA)

> Quick micro bench - did it twice and both times came out 17% slower.

I'd guess that all the OO construction/destruction costs in this part of your patch are slowing things down.

+    Searchable[] searchers = new Searchable[readers.length];
+    for(int i = 0; i < readers.length; i++) {
+      searchers[i] = new IndexSearcher(readers[i]);
+    }
+
+    MultiSearcher multiSearcher = new MultiSearcher(searchers);
+    return multiSearcher.search(weight, filter, nDocs, sort);

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'll be sure to include that info with the next set of results.

I don't think those results represent getting lucky though: its 4 rounds and 2 runs with the same results (17% both runs). Nothing scientific, just did it real quick to get a base feel of the slowdown before the patch is finished up.

EDIT Just like I forgot to take the optimize out of the sort alg when I pasted it here, looks like I missed it for the benches as well. Disregard those numbers.

Here is the alg I used:

merge.factor=mrg:50
compound=false

sort.rng=20000:10000:20000:10000

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=100000

docs.dir=reuters-out

doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker

query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=true
# -------------------------------------------------------------------------------------

{ "Rounds"
    { "Run"
      ResetSystemErase

      { "Populate"
        -CreateIndex
        { "MAddDocs" AddDoc(100) > : 500000
        -CloseIndex
      }

      { "TestSortSpeed"
        OpenReader  
        { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 
        { "SearchWithSort" SearchWithSort(sort_field) > : 5000
        CloseReader 

      }

      NewRound
     } : 4

} 

RepSumByName

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, I straightened things out, and now it looks like possibly no loss (for few segments anyway). Last I looked at the index, only 6 segments. I've got to put real time into all this later though. Only been able to give it some very backgroundish time this morning.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Okay, I straightened things out, and now it looks like possibly no loss

So if there was a 17% loss on the optimized index, and very little loss on a segmented index, I assume that means that matching/scoring is enough slower on the segmented index that the loss in sorting performance doesn't matter as much?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Ignore those first results entirely. It turns out I had the latest 1471 patched in. That shouldnt slow down a single segment though. Neither this or 1471 should have slowed things down because they only affect multisegment and multiindex searches I thought. Odd, but I just junked all of that and started fresh, did the tests a little closer to right, and see the numbers looking the same. Didn't want to get too into benching before its sorted out a bit more. I'll try to get enough time to be more rigorous later though. My free moments are under heavy attack by the female that appears to have made herself at home in my house.

As a side not, 1471 doesn't work in a couple ways with this patch - it throws both a nullpointer exception and a class cast exception in different circumstances.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually #2545 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise).

But warming after reopen should be much faster with this patch (we should try to measure that too).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

bq.I think there should be very little impact to performance, for single or multi segment indices, for the search itself against a warmed reader. (And actually #2545 should make things a wee bit faster, especially if n,m are largeish, though this will typically be in the noise).

That seems to be inline with what I got with 6 segments. I'm running some 30-50 seg range tests on my other laptop now.

But warming after reopen should be much faster with this patch (we should try to measure that too).

I've got a base alg for that type of thing around somewhere too, from 831. It should be about the same, which means pretty dramatic reopen improvements if you multiple segments, especially if the new segment is small. Its likely to be small in comparison to all of the segs anyway, which means pretty great improvements.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Ill bench again after this issue is polished up, but it looks like at 100 segments I am seeing the 20% drop. I didn't see any drop at 6 segments in a retest.

Ill do some longer, more thought out benchmarks when the patch is in better shape.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmmmmmmm.

OK I think I see what could explain this: insertion into the pqueue is fairly costly. So, because we now make 100 pqueues, each gathering top N results, we are paying much more insertion cost overall than the single queue that IndexSearcher(MultiReader) uses.

So.... how about still doing the searches per-sub-reader(searcher), but, make a HitCollector that gathers the results into a single pqueue, passing that HitCollector to each sub-searcher?

If that turns out OK, then I think it would make #2545 moot because we should similarly change MultiSearcher to use a single shared pqueue.

Actually I think this approach should be a bit faster, because there is some very small method call overhead to how MultiReader implements TermDocs/Positions by "concatenating" its sub-readers. So by pushing Searcher down onto each SegmentReader we should gain a bit, but it could very well be in the noise. For this reason we may in fact want to do this same thing for the "normal" (sort by relevance) IndexSearcher.search.

I wish I thought of this sooner. Sorry for the runaround Mark!

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> make a HitCollector that gathers the results into a single pqueue

That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

>> make a HitCollector that gathers the results into a single pqueue > > That's good when everything's local, but bad when things are distributed. If we movE RemoteSearchable to contrib (as discussed in LUCENE-1314) then this may not be a problem, but we might still leave hooks so that someone can write a search that uses a separate top-queue per remote segment.

Good point; so this means we can't blindly do this optimization to MultiSearcher (w/o having option to do separate queues, merged in the end). But for IndexSearcher(Multi*Reader).search it should be safe?

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> But for IndexSearcher(Multi*Reader).search it should be safe?

Right. Perhaps this is a reason to encourage folks to use MultiReader instead of MultiSearcher. Are there cases, other than distributed, where MultiSearcher is required? If not, perhaps it could be moved to the contrib/distributed layer too.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

make a HitCollector that gathers the results into a single pqueue

Well it certainly made the code cleaner and the patch a bit nicer, but on first quick test, I still see the 20% slowdown with 100 more/less segments.

I'm looking through too see where I may have done something funny.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

on first quick test, I still see the 20% slowdown with 100 more/less segments.

Argh! Can you post your current patch?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here is what I've got. The final sort test still fails, but the rest should pass.

Mark

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I also did a quick reopen alg. The speed gain on this can really vary depending on index access patterns. I tried adding 500000 (very small) docs with a random sort field. Then added 50000 docs and then reopen 10 times. Repeat all 4 times. Comparison is of the time it takes to load the fieldcache and do one search. With this patch it came out about 40-50% faster. Obviously going to depend on many factors in real world though. In certain applications I'm sure it could be many times faster or slower.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Doing a little profiling on the new code and off the top results of interest are:

FieldSortedHitQueue.lessThan(object,object) approx 12% FieldSortedHitQueue.insertWIthOverflow(object) approx 12% MultiReaderTopFieldDocCollector.collect(int,float) 6.3% FieldSortedHitQueue$4.compare() 5.3 %

and on...

For Lucene trunk, a day or two ago:

FieldSortedHitQueue.insertWIthOverflow(object) approx 11% TopFieldDocCollector.collect(int,float) 7.1% FieldSortedHitQueue.lessThan(object,object) approx 6.7% FieldSortedHitQueue.updateMaxScore 3.2% FieldSortedHitQueue$4.compare() 3.2 %

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think I see my mistake. Dumb one - I think I did it well trying to get things to work and didn't realize I left it in. I'll put up another patch after some benches finish.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, now its as fast, if not a bit faster. I was doing the priority queue n*readers, as part of getting tests to pass early. Left it after fixing things elsewhere, and boom - explains all the lessthan nonsense in the profiling. Whoops.

Looks good now though - still need to investigate failure of last sort test.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Exellent! That was a sneaky one.

I attached a tiny change to the patch, which is to set the docBase in MultiReaderTopFieldDocCollector; this saves the lookup into starts in each collect call.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I ran a quick perf test on a 100 segment index with 1 million docs (10K docs per segment), for a single TermQuery ("text"), and I'm seeing 11.1% speedup (best of 4: 20.36s -> 18.11s) with this patch, on Mac OS X. On Linux I see 6.3% speedup (best of 4: 23.31s -> 21.84s).

Single segment index shows no difference, as expected.

I think the speedup is due to avoiding the extra method call plus 2nd pass through the int docs[] to add in the doc base, in MultiSegmentReader.MultiTermDocs.read(int[] docs, int[] freqs).

This is a nice "side effect", ie in addition to getting faster reopen performance (the original goal here), we get a bump in single term search performance.

I think given this, we should cutover other search methods (sort-by-relevance, custom HitCollector) to this approach? Maybe if we add a new Scorer.score method that can accept a "docBase" which it adds into the doc() before calling collect()? In fact, if we do that, we may not even need the new MultiReaderTopFieldDocCollector at all?

Hmm, though, a Scorer may override that score(HitCollector), eg BooleanScorer does. Maybe we have to make a wrapper HitCollector that simply adds in the docBase and then invokes the real HitCollector.collect after shifting the docBase? Though that costs us an extra method call per collect().

Here's the alg I used (slight modified from the one above):

merge.factor=1000
compound=false

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.tokenized=true
doc.term.vector=false
doc.add.log.step=100000
max.buffered=10000
ram.flush.mb=1000

work.dir = /lucene/work

doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker

query.maker=org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker
file.query.maker.file = test.queries

task.max.depth.log=2

log.queries=true

{ "Populate"
  -CreateIndex
  { "MAddDocs" AddDoc(100) > : 1000000
  -CloseIndex
}

{ "Rounds"
  { "Run"
    { "TestSortSpeed"
      OpenReader  
      { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 
      { "SearchWithSort" SearchWithSort(sort_field) > : 500
      CloseReader 
    }
    NewRound
  } : 4
} 

RepSumByPrefRound SearchWithSort

It creates the index once, then does 4 rounds of searching with the single query "text" in test.queries (SimpleQueryMaker was creating other queries that were getting 0 or 1 hits).

I'm running with "java -Xms1024M -Xmx1024M -Xbatch -server"; java is 1.6.0_07 on Mac Pro OS X 10.5.5 and 1.6.0_10-rc on 2.6.22.1 linux kernel.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thanks Mike.

Are we going to be screwed with this filter stuff though? I'm looking into the last sort test failing - my first thought is that a filter with say 8 docs, is getting pushed down onto 4 searches with 2 docs in each reader type of thing. A filter that allows the 8th doc wont do very well on a bunch of 2 docs searches...or hopefully I don't know what I'm talking about. Looking to see if I can figure my way around it now.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hmmm...well this makes the test pass (I subtract the base from the filter doc id)...

Let me know if I'm all wet on this...

EDIT

that cant be quite right...I'll try making a different test.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One thing to fix is ParalellReader: it currently defines a getSubReaders method, but, this won't work for searching (because ParallelReader's sub-readers are not "concatenated"). I think we need a more specific name than "getSubReaders" (there's some initial discussion of this in LUCENE-1475). Maybe getConcatenatedReaders? getSequentialReaders? Something else...?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think given this, we should cutover other search methods (sort-by-relevance, custom HitCollector) to this approach? Maybe if we add a new Scorer.score method that can accept a "docBase" which it adds into the doc() before calling collect()? In fact, if we do that, we may not even need the new MultiReaderTopFieldDocCollector at all?

I like this idea. I'd love to figure out how 'outside' systems could get the reopen benefit as well (solr caches and such, beyond internal sorting). This seems like a first step towards that possibility (though I admittedly don't see a clear path yet).

Hmm, though, a Scorer may override that score(HitCollector), eg BooleanScorer does. Maybe we have to make a wrapper HitCollector that simply adds in the docBase and then invokes the real HitCollector.collect after shifting the docBase? Though that costs us an extra method call per collect().

Well, we might as well bench and see what we lose...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe, we should add "setDocBase" to HitCollector. Then, fix all core/contrib HitCollectors to respect this. Then add a method "supportsDocBase()" which returns false in default impl in HitCollector. Then, in search() if we are dealing w/ a HitCollector that does not supportDocBase() we have to wrap?

Alternatively, we could throw an UnsupportedOperationException in setDocBase() by default, catch that, and fallback to wrapping.

This way we avoid the extra collect() method call in the common cases (builtin HitCollectors). Also, we save an add when the doc is not competitive.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

You shouldn't need to subtract base from the filter's docID, since the filter is operating in the docID space of the sub-reader.

Is it TestSort.testTopDocScores that you see failing (that's what I see)?

Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs[0].doc, which is the docID space of the full reader.

I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

A filter fix thats closer to correct and a new test.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Is it TestSort.testTopDocScores that you see failing (that's what I see)?

right

Unfortunately, the filter in that test is assuming that the IndexReader it's passed in is equal to the "full" IndexReader, because it references docs1.scoreDocs[0].doc, which is the docID space of the full reader.

right, its using the full id space - isnt that legal though? It did work.

I would say the test is buggy (it's making an assumption about the API that happens to be true but was not guaranteed). However, this could mean similar filters "out there" think they can grab docIDs from the top-level IndexReader, cache them, and then assign them inside the getDocIdSet method.

That was my first thought - are they aloud to build the filter this way? But it worked to do it right? I like your assumption thought though - suits my lazy side :)

What if someone made a filter that is supposed to only allow the first doc? So it is set for 0...all of the first docs of each sub reader would match. That seemed to be supported before right?

EDIT

nm, I guess that still fits the 'assumption' thing. Feels odd changing this behavior still though - I guess you have this same issue over multisearchers though ...

How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How about those keeping their own reader to filter cache or something? A single filter could correctly apply against a single MultiReader before...now you would need a different filter for each subreader right?

Such cases would then be caching by sub-reader, right? That's the benefit of this approach. EG I had been thinking we'd need to fix the recently added FieldCacheRangeFilter to also "understand" when it's dealing w/ concatenated sub-readers, but in fact with this change that filter is only given the sub-readers, one at a time, and so it only asks FieldCache per sub-reader, and we automatically get faster reopen performance "for free" (no change to FieldCacheRangeFilter required).

Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, I follow now. If you did things correctly, you'll always be passed the right segmentreader through a hook. Nice. I think we really do want to do this for the other search methods.

Still, I agree this is probably dangerous to suddenly change, since there could easily be filters out there that are [illegally] using a docID not belonging/corresponding to the reader that was passed in. So maybe we should provide a migration path. EG, add "allowSubReaders" to Filter, defaulting to "return false" so that any external Filter impls still get passed the Multi*Reader, and then fix all core/contrib filters to return true from that method?

This seems reasonable. I am following your [illegal] argument better now though, so I wouldn't feel so bad leaving it out. If its unsupported behavior, I like the idea of adding backward compat cruft much less. I had it in my head that you might be caching things based on the top level multireader, but it looks like now, that you always should be using the reader provided by a hook - which will be the single seg reader.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think this is almost there: added the new logic to relevance and hitcollector search as well.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

That actually needed a couple of tweaks.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient?

Also, I don't think ParallelReader needs to throw UnsupportedOperationException in that method – it can fallback to the default?

It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0).

It's a delightfully small patch now!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also, back compat tests fail to compile because we renamed Multi*Reader.getSubReaders --> getSequentialSubReaders. So when committing this be sure to also fix the test-tag branch!

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Looks good! Do we really need IndexReader.supportsSequentialReaders? Because the default impl (return length 1 array of itself) seems sufficient?

Let me investigate. If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders, the test will throw a stack trace overflow on checking if the reader is closed. I put the illegal in there to make sure I wasnt calling it, because upon switching to getSubReaders, problem goes away. Seemed fair enough, but I'll I guess have to understand what was going on to really respond.

It would be nice to somehow deprecate "supportsDocBase" so that all outside HitCollectors would need to support it on upgrading to 3.0, but, I'm not sure how to cleanly do that. (Ie I'd rather not have that method continue to exist in 3.0).

+1. I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think.

It's a delightfully small patch now!

Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If you try using the default impl, and you change the parralellreaderreopen test to use getSequintialReaders rather than getSubReaders

But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders).

I dont see what we can do but release it deprecated with a note explaining. Fair enough for 3.0 I think.

The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh...

Yeah, this one had the great feeling of the multiterm patch - rolled right up into something nice. Goto love the Lucene API, flaws or not.

I love this kind :)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

So I don't know what that stackover flow was, but I just put things to how they should be and all tests pass.

So very close now. I'm not ready to commit myself though. I do most of my thinking after the work :)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

But you shouldn't change that test (it should continue to call ParalellReader.getSubReaders).

Right, I was coming at it from the wrong angle. I had refactored in the change with eclipse, and seeing that ParalellReader.getSuquentialReaders could cause a statckoverflow, thats why i put in the isSequentialSuported check. Guess its not a concern though. I didn't take much time to understand it.

The problem is this doesn't create a the right sequence. Ie, if we mark supportsDocBase as deprecated, then some ExternalHitCollector won't see the deprecation warning (they have not overridden supportsDocBase), and so then we can't remove it in 3.0 since their code will then silently break. I think the only thing to do would be deprecate collect() in favor of another method. Or, deprecate HitCollector entirely in favor of new class DocBaseHitCollector (don't like that name). Sigh...

This depends right? Don't we have a lot of latitude with 3.0? I would think we could require that you read some upgrade notes on changes...3.0 is our hope to change some things we couldn't normally get away with I thought :) I agree we should be friendly, like 1 to 2, but its tempting to use 3.0 to do some things more cleanly rather than less.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Oh one more thing: I think we don't need both supportsDocBase and setDocBase throwing UOE by default? How about just the 2nd one, and you fix search to catch the UOE and wrap in DocBaseCollector?

If you'd like...I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case...

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Could we add a new class like

public abstract class Hitable {
  public abstract void setBase(int base);
  public abstract void hit(int doc, float score);
}

upgrade everything in trunk to use this, and change HitCollector to:

/** `@deprecated` */
public abstract class HitCollector extends Hitable {
  public abstract void setBase() { throw new UOE(); }
}

then, for back-compatiblity, wrap anything that extends HitCollector to rebase?

Then you'd have neither an isSupported method nor use exceptions for control.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thats a great idea, I'll try that route.

Another tiny win in this is that MuliSearcher doesn't have to use a doubled up hitcollector anymore.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

upgrade everything in trunk to use this,

If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break? I really like this idea, but does it get around back compat?

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> If users are using a HitCollector as a ref to a trunk HitCollector that becomes a Hitable, won't their code break?

Lucene never passes someone a HitCollector that they didn't create. So all the classes that folks create may need to remain subclasses of HitCollector, but classes used internally do not need to, and we can deprecate all public HitCollector implementations and provide new versions that extend Hitable.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/class-use/HitCollector.html

Does that address your concern?

I'm not too fond of Hitable, but can't think of anything better. With that name, the method might better be called hit() than collect(), but that's a more invasive change. DocCollector? Collector?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, so subclasses of HitCollector stay subclasses until 3, and then they extend Hitable? Sounds good. EDIT No that wont work and I read your comment wrong - we have to provide new impls, and old impls will be wrapped. Got it. Patch coming.

I kind of like Hitable myself. You are collecting a 'hit' with a Hittable object. A hit is a doc and score. Any other opinions? I'm less fond of other Collector variations. hit() could be better than collect(), but collect() is not such a stretch in my mind.

EDIT

One further thought, if we are replacing TopDocCollector, what would it become? TopDocDocCollector, TopDocHitable? I'm liking these names less...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I'm not a fan of control with exceptions. I am also not a fan of of isSupported methods though...I was leaning that way over exceptions...sounds like you lean the other way...? I guess it has its appeal in this case...

I agree, though I leaned towards using an exception because 1) no new API needing future deprecation would then have been added, and 2) the expectation (but not forced, and this is a problem) is that over time all HitCollector subclasses would implement setBase, so the exception would be the exception (heh) not the rule.

But I like Doug's proposal instead, since on upgrading to 2.9 you'll see your code is deprecated, which then allows us to drop it in 3.0. I have a slight preference for DocCollector.

This means any methods that accept a HitCollector would also be deprecated, and we'd add a new method that takes DocCollector instead, and change the deprecated one to wrap a DocBaseCollector around the HitCollector and invoke the new method.

> we can deprecate all public HitCollector implementations and provide new versions that extend Hitable

Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and then in 3.0 change it to directly extend DocCollector (Hitable)? (This saves having to deprecate TopDocCollector and at least 2 others).

> Don't we have a lot of latitude with 3.0?

I think in 3.0, when changing APIs, we are only allowed to remove deprecated APIs from 2.9? Ie we can't do more drastic changes.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Could we leave (eg) TopDocCollector extending HitCollector in 2.9, and then in 3.0 change it to directly extend DocCollector (Hitable)? (This saves having to deprecate TopDocCollector and at least 2 others).

If we do that, I don't think we can do our check to use base or not by checking for HitCollector right?

apache / lucene

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483] #2557

2381

3793