apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.66k stars 1.03k forks source link

Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector [LUCENE-1483] #2557

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them.

This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment.

When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily.

All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway).


Migrated from LUCENE-1483 by Mark Miller (@markrmiller), 1 vote, resolved Feb 02 2009 Attachments: LUCENE-1483.patch (versions: 35), LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, sortBench.py, sortCollate.py Linked issues:

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Is that right? We better get going on java5 conversion then...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> If we do that, I don't think we can do our check to use base or not by checking for HitCollector right?

Ahh right, so if a user outside creates a TopDocCollector & passes it in, we can't tell that it can handle setBase natively. Or we could test & catch the UOE (since HitCollector is deprecated, using an exception here will go away in 3.0).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Is that right? We better get going on java5 conversion then...

Ahh right, that too :) But, as of 3.0 we can start accepting/making changes to Lucene that require 1.5 JRE, but still we can't just swap out APIs w/o going through deprecation first?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I thought there were some that wanted to change some of the API to java 5 for the 3.0 release, cause I thought back compat was less restricted 2-3. I guess mabye that won't end up happening, if it was going to, it seems we'd want to deprecate what will be changed in 2.9.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think its almost there. I still want to spend a little time looking it over, but I think its looking good.

asfimport commented 15 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I haven't been following this issue too closely (or truthfully: at all) but if there's talk of deprecating HItCollector and introducing a new superclass to replace it, would it also make sense to revist the idea of adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

I remember this coming up as a "wish list" back when TimeLimitedCollector was added (but i don't really remember if it was decided that the current implementation was actually better then if collect() did have a return value)

Anyway, just something to ponder...

public abstract class Hitable {
  public abstract void setBase(int base);
  public abstract HitStatus hit(int doc, float score);
}
/** `@deprecated` */
public abstract class HitCollector extends Hitable {
  public abstract void collect(int doc, float score);
  public void setBase(int base) { throw new UOE(); }
  public HitStatus hit(int doc, float score) { 
    collect(doc, score); 
    return HitStatus.OK;
  }
}
asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mark, I got on hunk (HitCollector) failed on applying the patch – looks like it's the $Id$ issue again (your area doesn't expand $Id$ tags). No problem – I just applied it manually.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.).

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

adding a return type to the collect/hit method? ... ie: an enum style result indicating "OK" or "ABORT" (with the potential of adding additional constants later ala FieldSelectorResult)

I think we should consider this, though this then implies an if stmt checking the return result & doing something, on each hit, so we should test the cost of doing so vs the cost of throwing an exception instead (eg we could define a typed exception in this new interface which means "abort the search now" and maybe another to mean "stop searching & return the results you got so far", etc.).

This looks like a really good idea. Currently to stop a iterator, I use an exception class that extends RuntimeException (to have it unchecked) to cancel a search. Very nice if you support it directly.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Duh: I just realized that when we switched back to a single pqueue for gathering results across the N subreaders, we lost the original intended "benefit" for this issue. Hard to keep the forrest in mind when looking at all the trees....

Ie, we are now (again) creating a single FieldSortedHitQueue, which pulls the FieldCache for the entire MultiReader, not per-segment. So warming time is still slow, when sorting by fields.

Really we've "stumbled" on 2 rather different optimizations:

  1. Run Scorer at the "sub reader" level: this gains performance because you save the cost of going through a MultiReader. This requires the new DocCollector class, so we can setDocBase(...).
  2. Do collection (sort comparison w/ pqueue) at the "sub reader" level: this gains warming performance because we only ask for FieldCache for each subreader. But, it seems to hurt search performance (pqueue comparison & insertion cost went up), so it's no longer a no-brainer tradeoff (by default at least).

Given that #1 has emerged as a tentatively fairly compelling gain, I now think we should decouple it from #2. Even though #2 was the original intent here, let's now morph this issue into addressing #1 (since that's what current patch does), and I'll open a new issue for

2?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Ugg...you know I was afraid of that when I was making the change, but I easily convinced myself that FieldSortedHitQueue was just taking that Reader for AUTO detect and didnt really relook. It also makes the comparators. Bummer. I guess lets open a new issue if we can't easily deal with it here (I've got to look at it some more).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I've got a quick idea I want to try to fix it.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Bah. You can't share that queue and get the reopen benefit without jumping too many hoops. All of a sudden you can't use ordinals, comparators need to know how to compare across comparators, and it just breaks down fast. How disappointing.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah, ugg. This is the nature of "progress"! It's not exactly a straight line from point A to B :) Lots of fits & starts, dead ends, jumps, etc.

We could simply offer both ("collect into single pqueue but pay high warming cost" or "collect into separate pqueues, then merge, and pay low warming cost"), but that sure is an annoying choice to have to make.

Oh, here's another idea: do separate pqueues (again!), but after the first segment is done, grab the values for the worst scoring doc in the pqueue (assuming the queue filled up to its numHits) and use this as the "cutoff" before inserting into the next segment's pqueue.

In grabbing that cutoff we'd have to 1) map ord->value for segment 1, then 2) map value->ord for segment 2, then 3) use that cutoff for segment 2. (And likewise for all segment N -> N+1).

I think this'd greatly reduce the number of inserts & comparisons done in subsequent queues because it mimics how a single pqueue behaves: you don't bother re-considering hits that won't be globally competitive.

We could also maybe merge after each segment is processed; that way the cutoff we carry to the next segment is "true" so we'd reduce comparisons even further.

Would this work? Let's try to think hard before writing code :)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

We could simply offer both ("collect into single pqueue but pay high warming cost" or "collect into separate pqueues, then merge, and pay low warming cost"), but that sure is an annoying choice to have to make.

Agreed. I really hope we don't have to settle for this.

Oh, here's another idea:

Good one! Keep those ideas coming.

Would this work?

It sounds like you've nailed it to me, but I'll let it float around in my head for a bit while I work on some other things.

Let's try to think hard before writing code :)

Now theres a new concept for me. My brain will work itself to death trying to avoid real work :)

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> public abstract void setBase(int base);

It occurred to me last night that this really has no place in HitCollector. We're forcing applications to handle an implementation detail that they really shouldn't have to know about. It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> It would be better to pass the base down to the scorer implementations and have them add it on before they call collect(), no?

So we'd add Scorer.setDocBase instead?

The only downside I can think of here is that often you will perform the addition when it wasn't necessary.

Ie, if the score is not competitive at all, then you wouldn't need to create the full docID and so you'd save one add opcode.

Admittedly, this is a very small (tiny) cost, and I do agree that making HitCollector know about docBase is really an abstraction violation...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Oh, here's another idea: do separate pqueues (again!), but after the first segment is done, grab the values for the worst scoring doc in the pqueue (assuming the queue filled up to its numHits) and use this as the "cutoff" before inserting into the next segment's pqueue.

We've got to try it. Whats the hard part in this? Converting a value to an ord?

EDIT

Okay, I see, we can just find our place by running through new value Comparables.

An added cost for going back to per reader is that all doc id values (not ords) also need to be adjusted (for the multisearcher).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, here's another tweak on the last proposal: maybe we could, instead, take the pqueue produced by segment 1 and "convert" it into the ords matching segment 2, and then do normal searching for segment 2 using that single pqueue (and the same for all seg N -> N+1 transitions)?

For all numeric fields, the conversion is a no-op (their ord is currently the actual numeric byte, short, int, etc. value, though conceivably that could change in the future); only String fields, and custom (hmm) would need to do something.

This should be more efficient than the cutoff approach because it'd result in less comparisons/inserts. Ie, it's exactly a single pqueue again, just with some "conversion" between segments. The conversion cost is near zero for numeric fields, and for string fields it'd be O(numHits*log2(numValue)), where numValue is number of unique string values in next segment for that sort field. I think for most cases (many more docs than numHits requested) this would be faster than the cutoff approach.

Would that work?

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

segment 1 has terms: apple, banana, orange segment 2 has terms: apple, orange

What is the ord of banana in segment2?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> What is the ord of banana in segment2?

How about 0.5?

Ie, we just need an ord that means it's in-between two ords for the current segment.

On encountering that, we'd also need to record it's real value so that subsequent segments could look it up properly (or, if it survives until the end, to return the correct value "banana").

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, but how am I going to squeeze between two customs? I guess you'd have to store as a compare against either side?

EDIT

There is also the problem that all compares are done based on ScoreDocs that index into a single ord array by doc. The previous pq's ScoreDocs will not compare right - they won't index into the ord array for the current Reader - they are indexes into the array for the previous Reader. This is what made me give up on single pq earlier.

EDIT

I guess we put them on the ScoreDoc like we do the values for multisearcher? Then we could use a PQ like FieldDocPQ that used ords rather than vals?

EDIT

Hmmm...How do I get at the ordinals though? The value is exposed, but the ordinals are hidden behind a compare method...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay, in a pinch I guess we just grab the ordinals straight from the field cache and violate the comparator abit. But we don't have the score docs until after running the collector, so we cant perch stuff on them.

Hmm - we need something like the current hits work with the standard get ord mechanism (which has its own probs cause we compare, we don't look at ords) and the last hits work with an ord on the scoredoc or something. Its all ugly stuff in my head.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm exploring one possible approach, with a new Comparator API that's told when to switch to the next subReader (which gives it the chance to translate the ords in the queue). Not sure it'll work out yet though...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached initial patch (derived from one of the earlier patches). Alot of work remains. TestSort (and likely others) fail.

> Thats were I don't follow though - its not ords in the queue right? Its ScoreDocs. Thats whats getting me at the moment.

Exactly – so I built first cut at the alternative "copy value" approach, where the comparator (new FieldComparator abstract class) is responsible for holding the values it needs for docs inserted into the queue. I also added TopFieldValueDocCollector (extends DocCollector), and ByValueFieldSortedHitQueue (extends PriorityQueue) that interacts with the FieldComparators. (We can change these names...). I updated IndexSearcher to use this new queue for field sorting.

This patch only handles SortField.{DOC,SCORE,INT} now, but I think the approach has early surprising promise: I'm seeing a sizable performance gain for the "sort by int field" case (13.76 sec vs 17.95 sec for 300 queries getting top 100 hits from 1M results) --> 23% faster. I verified for the test sort alg (above) it's producing the right results (at least top 40 docs match).

I didn't expect such performance gain (I was hoping for not much performance loss, actually). I think it may be that although the initial value copy adds some cost, the within-queue comparsions are then faster because you don't have to deref back to the fieldcache array. It seems we keep accidentally discovering performance gains here :)

If we go forward with this approach I think it'd mean deprecating FieldSortedHitQueue & ScoreDocComparator, because I think there's no back-compatible way to migrate forward. I also like that this approach means we only need an iterator interface to FieldCache values (for LUCENE-831).

Mark can you look this over and see if it makes sense and maybe try to tackle the other sort types? String will be the most interesting but I think very doable.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Fantastic Mike! I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types. This is great stuff I think, but I need some more time to digest it all. Still looks like String will be the interesting piece. I think we have to fill fields for the multisearcher as well, which is annoying, because the multisearcher will have to let the indexsearcher know to do it, or we may do it for no reason. Then we just have to clear up the setbase stuff. Light at the end of the tunnel!

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

It seems there are a few hoops to jump through with strings beyond what I was thinking. We don't have the full array of unique terms now, but rather a bunch of smaller arrays of unique terms, with overlap. That makes converting Strings to ords very difficult right? We almost have to create the full array. Also, we still have to do some fancy foot work to get a proper ord. Then we have to juggle and track things so that we can return the String value in getValue (we are more concerned with ords, so its not fun). It seems a lot of spinning/tracking unless we go back to a full StringIndex. Perhaps its just as good to just compare by String value and give up on ord? It really seems that by the time we jump through every hoop to create ords from Strings on a bunch of smaller StringIndexes, we will have at least eaten as much as it costs to do String.compare? Not a conclusion really though, I am still mucking...

Any insight?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I was off on the fillFields stuff, I forgot we are back to a single hit collector. EDIT - can't even follow my own thoughts - I wasn't off, you are just already handling the adjustment that needs to be made. I'd like to avoid filling fields unless we are in a multisearcher still though...

I've got everything working except custom and locale stuff, in a suboptimal way anyway. String values rather than ords, and there is plenty that can probably be improved.

4 tests still fail (3 with custom, 1 with locale). Still trying to lock down the best way to deal with String ords/values.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I've started working through the tests and I've corrected a small ordering problem and added most of the other basic types.

Great!

> Then we just have to clear up the setbase stuff.

Yeah I think we should remove that (and use setNextReader instead).

> That makes converting Strings to ords very difficult right?

Right (this was the challenging example Yonik brought up above).

How about something like this: at any given time, the slots are filled with an instance that has 1) the ord (that "matches" the current reader) and 2) the actual value. When transitioning readers you have to remap all ords to the new reader (but keep the value unchagned): for each slot, you take its String value look it up in the new reader w/ binary search. If it's present, assign it the corresponding ord. If it's not present, it must fall between ord X and X+1 so assign it an ord of X+0.5.

Then proceed like normal, since all ords are now "matched" to your current reader. You compare with ords, never with values.

The one caveat is... we have to take care with precision. A float X+0.5 won't be precise enough. We could use double, or, we could use long and "remap" all valid ords to be 2X (even) their value, and all "in between" ords (carried over from previous reader) to be odd values. I think we'd need long for this since in the worst case, if you have max number of docs and each doc has unique string you would have 2^31 unique ords to represent.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

You do not need to map to long, if you would use negative integers as marker for in between. So doc 0 has -1, doc 1 has -2, doc 2 has -3,..., doc (2^31-1 == Integer.MAX_VALUE) would have (-2^32 == Integer.MIN_VALUE)

Just an idea.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

How about something like this:

Okay, I actually got a ways down that path before I gave up...it was clearer in my head before I got the new code - that had me trying to map one ord at a time (my fault, not the code of course) - but right, I'd have to map them all on reader change. Clears up the haze a bit.

Locale is in, custom coming. That will make all tests pass. Then I'll get back on this String stuff (switch to ord from the value stuff I have going now). Then cleanup. Very awesome stuff , thanks Mike.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

What do we do about the old SortComparatorSource et al? I can't figure out a way to make old custom implementations work with this - they implement logic that compares ScoreDocs and I cant see how I can make a new comparator that works based on custom SortComparators.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

You compare with ords, never with values.

But I still have to implement getValue for fillFields. That means either storing the value along with ords (at each slot), or storing the orig ord plus the orig StringIndex is came out of for each slot...right?

EDIT

Okay, that answer is in your earlier comment anyway - we have to track value by slot as well. I'm almost there...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> You do not need to map to long, if you would use negative integers as marker for in between

Actually a variation on this would work: we could logically do the even/odd ord mapping, but shift it down into negative numbers to make use of the full int range. That'd give us enough precision.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> What do we do about the old SortComparatorSource et al?

I think we have to deprecate SortComparatorSource and ScoreDocComparator (in favor of FieldComparator and we should add a FieldComparatorSource, too I guess). Ie, custom sort implementations would need to migrate to the new API?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Sorry that does not work because of comparing, but you could substract 2^31 form doc number and multiply by two.

Ahh I missed that (I said the same thing 4 comments later... moving fast here!).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Does this have to wait for 3 then? How can we back compatibly tell everyone they have to write new custom comparator implementations?

If we deprecate old and introduce new in 2.9, then we can remove old in 3.0, right?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

I've got the String stuff working (I think) except for one issue...if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order (eg when you convert a value, its not in the terms list for the next binary search). I tried some naive tricks to adjust the number based on the old ord, but no real progress yet...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one?

> if a bunch of ords already in the queue map to the same index, they need to be adjusted to be in order

Hmmm! Perhaps on comparing two "odd" values that are the same we could fallback to a true String.compareTo (I don't like that added if, though, it should be rare). Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

If it's not present, it must fall between ord X and X+1 so assign it an ord of X+0.5.

I went down this thought path in the past... float doesn't have the precision, double might, could use long and left shift to make space for "inbetween", etc.

One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values. At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords.

Why are ords useful: (a) fast comparison when reordering the priority queue (insertion/removal of items) (b) fast comparison to determine if something should be inserted into the priority queue (c) other (non sorting) usage, using term numbers instead of term values.

Using ords for (a) seems expensive, as all ords in the queue must be translated. Using ords for (b) means calculating an ord only for the smallest element of the priority queue.

When doing a large search with a diverse set of keys to sort by, there are many more checks to see if an item should be inserted into the queue than are actually inserted. Also, when using an ord for (b), it doesn't have to be exact - it can be rounded since it's purpose isn't an exact comparison, but just an optimization to determine if insertion into the priority queue can be skipped.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

One problem is that putting an ord "inbetween" isn't good enough since you may be mapping many values.

I don't like that added if, though, it should be rare

Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index.

I'm still digesting the rest of what yonik said. I don't have this String ord stuff working 100%, but I have it working sort of (seems to sort pieces right, but then the pieces are out of order). I don't think the precision exists to do what I'm doing anyway, but since it kind of works, I could prob bench the diff of using values rather than ords.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

> But if you are using the old one it wont work with the new Searcher methods? I guess I'll have to think about it some more.

>>I think the search methods, on seeing that there is a SortField.CUSTOM type and then seeing that it's a SortComparatorSource in there, would fallback to the current impl, but all other Sorts would use the new one?

Ahhh...I was afraid of the ugliness. All right.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Not rare right? Let say there are 20 values in the queue, and now the new reader has 2 values. Lots will map to the same index.

True, not rare in that case, but then the added cost will be tiny since the new segment has relatively so few hits.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> At that point, one starts wondering if it's worth getting an ord, and what is solved or optimized by using ords.

I agree: all this complexity may not be worth it, if simple compare-by-value in fact performs well enough. I think we should benchmark both.

Mark if you get things to a semi-stable state & post a patch we can do some benching. You could actually just have two SortField types (STRING and STRING_ORD) then we could easily swap back & forth.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Or, better, we could add a "subord" to break the tie. That subord'd have to be computed by gathering all String values that sorted to the same "odd ord" and sorting them to assign subords.

That seems like a bit or work to manage. What if I just kept a separate double subords that holds the old mapping? If two slots are equal, we can fall down to that? We would have the whole new array, but the other way we have to track what maps to the same index, sort that, and then have a map or another array anyway. What do we save? We could have an int[] rather than a double[]? But a lot more juggling...

EDIT Nm...that logic doesnt quite work...

EDIT

Or it does...man I have a scatterbrain and a half. I got it to work (at least partially - the tests pass). I'll now work on getting something up for you to test with. My main worry at this point is that I'm doing things inefficiently enough that its not a fair test :)

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Quick question: what should I use for the value version...Strings or StringIndex? StringIndex takes less mem, but requires two array lookups.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Here you go. I'm sorry I've made everything so messy :) I'll clean later.

There is now SortField.STRING_VAL and STRING_ORD.

STRING uses StringIndex, STRING_VAL uses Strings[] STRING_ORD uses ordinals.

I havn't benched anything yet myself.

There is still a lot of cleanup to do beyond this String stuff (and the old comparators still don't work)

I havn't gotten over everything post scramble yet: hope I'm not doing anything too stupid. Sort tests pass other than custom comparator.