Closed asfimport closed 14 years ago
Uwe Schindler (@uschindler) (migrated from JIRA)
Also the current flex branch produces lots of unchecked warnings... The Generics policeman will visit them and will hopefully help fixing!
Michael McCandless (@mikemccand) (migrated from JIRA)
The Generics policeman will visit them and will hopefully help fixing!
Uh-oh... I sense heavy committing in flex branch's future!
Karthik K (migrated from JIRA)
What would the branch name for flex indexing ?
Uwe Schindler (@uschindler) (migrated from JIRA)
I Mike,
maybe I found an issue in LegacyFieldsEnum & Co:
@Override
): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?Uh-oh... I sense heavy committing in flex branch's future!
It is not so much, mainly in code added before the final Java 1.5 switch. Should be easy to fix, will look into it in a few days. So the police inspector only has few minor complaints :-)
Michael McCandless (@mikemccand) (migrated from JIRA)
What would the branch name for flex indexing ?
It's https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
Michael McCandless (@mikemccand) (migrated from JIRA)
The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.
Nice catch – I'll open a new issue & fix.
Also (you see it because missing
@Override
): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?
I actually intentionally left .close() out of all *Enums.
First, to strongly bias impls from doing such costly things that close is necessary. These enums are used in hotspots during searching.
Second, because for Lucene's core impls, close() is [almost?] always a no-op: these impls are very lightweight.
Third, because in Lucene we don't consistently close the enums we pull, today, so we have confusion (there have been posts to java-user about this – "do I need to close the TermEnum/TermDocs"). I'd rather not add a "close" that for all core impls is a no-op and so Lucene doesn't have to call close.
Fourth, because it complicates our impls if we really must close whenever we pull a enum – eg our Scorers pull enums today, but never close them.
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch – will commit soon. I found a bug in the "flex API on non-flex" layer (the preflex codec) – exposed with a new test case in TestBackCompat, and fixed. Also cleaned up some nocommits, added indexDivisor to the loadTermsIndex API, and fixed preflex to actually implement it.
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, which adds some nice test coverage of the back compat layers. We still need more tests but this is a good step forward...
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch attached – strengthened the back compat testing, which uncovered some issues in the back compat layers, that I've now fixed.
I ran the FlexTestUtil.verifyFlexVsPreFlex on a 5M doc pre-flex wikipedia index, and a 1M doc flex wikipedia index, with no problems.
I also ran my NRT stress test, starting from a pre-flex 5M wikipedia, and indexing using flex, so this is a good test of mixed pre/post flex segments, with no problems.
Getting closer...
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, changing oal.index.TermRef -> oal.util.BytesRef.
I think, eventually, we should fix the various places that refer to byte slices, eg Field.get/setBinary*, Payload, UnicodeUtil.UTF8Result, IndexOutput.writeBytes, IndexInput.readBytes, to use BytesRef instead.
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch w/ various fixes:
Switch over payloads to use BytesRef, in flex API
DocsEnum.positions now returns null if no positions were indexed (ie omitTFAP was set for the field). Also fixed Phrase/SpanQuery to throw IllegalStateException when run against an omitTFAP field.
Rename PositionsConsumer.addPosition -> .add
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch with a major reworking of some parts of flex:
Simplified how the StandardTermsDictReader/Writer interacts with the postings impl. The PostingsReader for the codec is now stateless, capturing all state for a given term in a dedicated TermState class (which also works well w/ caching, since we needed to capture state for that anyway).
Merged docs & positions readers, in the codec's impl and in the exposed flex API. It was just too hairy before, with separate classes for reading docs & positions. This is a step back towards current trunk API, ie, up front you ask for either a DocsEnum or a DocsAndPositionsEnum.
Modified API semantics: if a field or term does not exist, then IndexReader.termDocs/PositionsEnum may now return null (previously they returned a fake empty enum). This means more Weight.scorer() may return null.
I added IndexReader.getSubReaderDocBase (there is a separate jira issue open for this) – this is now more important because a filter can no longer guess its doc base by adding up docCount of all readers it sees since if the scorer for that segment is null, Filter.getDocIdSet will not be called.
Changed the reuse of Docs/AndPositionsEnum to be explicit. Previously the Terms or TermsEnum instance was holding a private reused instance... but that was no good because typically we can share the TermsEnum but cannot share postings enums.
Likeways, changed the public flex reading API, so that you don't separately ask for positions enum at each doc. Instead, up front you either ask for a DocsEnum or a DocsAndPositionsEnum. This matches how the current Lucene APIs work.
Terms dict cache is now at the top level, not per field (this matches how trunk works, ie all fields share the 1024 sized cache)
I cutover all codecs to the new API... all tests pass if you switch the default codec (in oal.index.codec.Codecs.getWriter) to any of the four.
Michael McCandless (@mikemccand) (migrated from JIRA)
New flex patch attached:
I factored out separate public Multi* (Fields, Terms, etc.) from DirectoryReader, DirectoryReader. These classes merge multiple flex "sub readers" into a single flex API on the fly.
Refactored all places that need to merge sub-readers to use this API (DirectoryReader, MultiReader, SegmentMerger). This is cleaner because previously SegmentMerger had its own duplicated code for doing this merging; now we have a single source for it (though merging swaps in its own docs/positions enum, to remap docIDs around deletions).
Changed the semantics of IndexReader.fields() – for a multi reader (any reader that consist of sequential sub readers), fields() now throws UOE.
This is an important change with flex – the caller now bears responsibility for create a MultiFields if they really need it.
My thinking is that primary places in Lucene that consume postings now operate per-segment, so a multi reader (Dir/MultiReader) should not automatically "join up high" because it entails a hidden performance hit. So consumers that must access the flex API at the multi reader level should be explicit about it...
However, to make this simple, I created a sugar static methods on MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do this, and cutover places in Lucene that may need direct postings from a multi-reader to use this method.
I've updated the javadocs explaining this.
Robert Muir (@rmuir) (migrated from JIRA)
Mike, here is a patch for removal of fuzzy nocommits:
Robert Muir (@rmuir) (migrated from JIRA)
btw, i benched that patch with my contrived benchmark for #3165, wierd that flex was slower than trunk before. numbers are stable across many iterations.
unpatched flex | patched flex | trunk |
---|---|---|
4362ms | 3239ms | 3459ms |
Uwe Schindler (@uschindler) (migrated from JIRA)
Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer.
Uwe Schindler (@uschindler) (migrated from JIRA)
Here the EmptyTermsEnum singleton patch (against flex trunk).
Michael McCandless (@mikemccand) (migrated from JIRA)
Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.
Uwe Schindler (@uschindler) (migrated from JIRA)
Updated patch for empty TermsEnum. It is now a singleton in TermsEnum class itsself.
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch looks good Uwe – thanks for re-merging!
Michael McCandless (@mikemccand) (migrated from JIRA)
Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.
Ahh, and also because your patch switches from String to char[], which should improve perf.
Your patch looks good Robert! Thanks.
Robert Muir (@rmuir) (migrated from JIRA)
Ahh, and also because your patch switches from String to char[], which should improve perf.
actually i didnt apply your LUCENE-2111 when running the benchmark (the improvement is simply the char[]). the test is now actually slightly slower now with the rest of LUCENE-2111
Robert Muir (@rmuir) (migrated from JIRA)
here is a rough patch, that merges BytesRef/UnicodeUtil
there are some breaks (e.g. binary api compat), but its an internal api.
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch looks good Robert! Thanks :)
Why not just remove UTF8Result altogether? (ie don't bother deprecating). It's an internal API...
The new method to compute hash is great, saving the extra pass in THPF.
Robert Muir (@rmuir) (migrated from JIRA)
ok, i ditched UTF8 result entirely, committed in revision 915511
Robert Muir (@rmuir) (migrated from JIRA)
attached is a patch that changes various exposed apis to use @lucene
.experimental
i didnt mess with IndexFileNames as there is an open issue about it right now.
Robert Muir (@rmuir) (migrated from JIRA)
these tags are added in revision 915791.
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, fixing some more nocommits, and renaming BytesRef.toString -> BytesRef.utf8ToString.
Robert Muir (@rmuir) (migrated from JIRA)
Here is a few more toString -> utf8ToString. will look at the backwards tests now
Robert Muir (@rmuir) (migrated from JIRA)
a few more easy nocommits
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, fixes flex APIs to not return null (instead return .EMPTY objects).
Robert Muir (@rmuir) (migrated from JIRA)
patch for review of flex merge
Robert Muir (@rmuir) (migrated from JIRA)
patch for review, backwards tests merge to flex
Michael McCandless (@mikemccand) (migrated from JIRA)
Cuts over to returning .EMPTY instead of null when requesting enums.
Also disallows Multi/DirReader.getDeletedDocs in favor of static convenience method MultiFields.getDeletedDocs.
But... we now need a way to determine that a codec does not store positions. I [illegally, adding nocommits] had to add a few places where I check the return result against .EMPTY.
Michael McCandless (@mikemccand) (migrated from JIRA)
But... we now need a way to determine that a codec does not store positions.
Thinking more about this... I think we should switch back to a null return from .docsAndPositionsEnum if the codec doesn't support positions. We only return .EMPTY if the enum is really just empty.
Robert Muir (@rmuir) (migrated from JIRA)
Patch for MTQ to not use null in getTermsEnum for backwards compat, so null can have some other meaning. Instead it uses VirtualMethod, with the default implementatinos throwing UOE.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe asked for a test for the MTQ back compat... attached is one. I think it looks kinda dumb but if its useful, I'll commit it.
Michael McCandless (@mikemccand) (migrated from JIRA)
Changed .iterator() to never return null, MTQ.getTermsEnum() to never return null, but IR.fields() and Fields.terms(String field), and .docs/.docsAndPositions can return null.
Also whittled down more nocommits – down to 53 now!
Michael McCandless (@mikemccand) (migrated from JIRA)
Back compat test for MTQ looks good Robert... thanks!
Michael McCandless (@mikemccand) (migrated from JIRA)
Down to 15 nocommits!
Michael McCandless (@mikemccand) (migrated from JIRA)
Same patch, just fixes the Java 1.6 only changes (adding @Override
to interface).
Michael McCandless (@mikemccand) (migrated from JIRA)
New rev, just a few changes:
Rename BytesRef.toBytesString -> toString (thanks Robert!)
No more BytesRef.Comparator – just use java.util.Comparator<BytesRef> (thanks Uwe!)
Use Set<String> not Collection<String> when asking Codec for its files/extensions (thanks Mark!)
I'll commit (on flex branch) sometime today. Making me nervous carrying such a large patch...
Michael McCandless (@mikemccand) (migrated from JIRA)
More nocommit reductions and other fixes:
Rename Codecs -> CodecProvider
Try to optimize flex API on pre-flex index (don't clone the SegmentTermEnum during seek)
Use SegmentReadState class when getting fieldsProducer (matches SegmentWriteState)
Cuts over to a new bulk-read API on DocsEnum that's designed to let int block codecs provide direct access to the int[] they have (saves extra copy).
Down to 9 nocommits!!
Michael McCandless (@mikemccand) (migrated from JIRA)
Forgot to add SegmentReadState.java
Michael McCandless (@mikemccand) (migrated from JIRA)
Also adds reuse to pre-flex API when getting a new TermsEnum.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Shai!
Michael McCandless (@mikemccand) (migrated from JIRA)
Duh – wrong issue! I only wish.... ;)
Michael McCandless (@mikemccand) (migrated from JIRA)
Attached patch, eliminating all remaining nocommits on flex branch! I turned most of them into TODOs :)
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm benchmarking flex vs trunk, but uncovered a strange performance loss with WildcardQuery. I'm attaching the python wrapper around contrib/benchmark that I'm using. Hopefully this is something silly...
You have to edit flexBench.py, specificaly the TRUNK_DIR and FLEX_DIR must point to the .../contrib/benchmark of each source area, and you have to edit the WIKI_LINE_FILE and/or WIKI_FILE (I think WIKI_LINE_FILE can be None in which case it should (but I haven't tested recently!) fallback to parsing the .xml.bz2 wikipedia export).
I'll first build an index of the first 5M wikipedia docs, once for flex and once for trunk, and then run the test queries. It also tests the "flex API on trunk index" case, to test perf of the flex emulation layer... this layer is looking a bit slowish now but I'm not sure how much we can do to speed it up...
Run like this:
python -u flexBench.py -run test
I have it set to only test only the wildcard query uni*t right now... and I'm getting this result:
JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
Index /x/lucene/flex.work.wiki.nd5M already exists...
Index /x/lucene/trunk.work.wiki.nd5M already exists...
Index /x/lucene/flex.work.random.nd5M already exists...
Index /x/lucene/trunk.work.random.nd5M already exists...
RUN: source=wiki query=un*t sort=None
run trunk...
cd /root/src/clean/lucene/contrib/benchmark
log: /root/src/clean/lucene/contrib/benchmark/logs/trunk.0
62.49 QPS
run flex on trunk index...
cd /root/src/flex.clean/contrib/benchmark
log: /root/src/flex.clean/contrib/benchmark/logs/flexOnTrunk.1
25.87 QPS [-58.6% worse]
run flex on flex index...
cd /root/src/flex.clean/contrib/benchmark
log: /root/src/flex.clean/contrib/benchmark/logs/flexOnFlex.2
39.30 QPS [-37.1% worse]
124623 hits
Other queries I've tested look OK so far...
Michael McCandless (@mikemccand) (migrated from JIRA)
Towards wrapping up flex, I ran a set of tests to benchmark flex's search performance vs trunk.
All tests are on a 5M doc Wikipedia index, best qps of 5 runs where each run runs the query for 5.0 seconds. Env is:
JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
First table compares trunk against "flex on flex", ie, a flex index (fully reindexed after upgrading to flex):
Query | Tot hits | Sort | QPS trunk | QPS new | Pct change |
---|---|---|---|---|---|
1 | 591225 | 68.36 | 80.64 | 18.0% | |
title | 64.12 | 68.53 | 6.9% | ||
1 OR 2 | 953081 | 19.35 | 20.80 | 7.5% | |
title | 16.50 | 17.48 | 5.9% | ||
1 OR 2 OR 3 | 1131679 | 14.37 | 15.50 | 7.9% | |
title | 12.42 | 13.26 | 6.8% | ||
1 OR 2 OR 3 OR 4 | 1266805 | 10.94 | 12.76 | 16.6% | |
title | 10.36 | 11.05 | 6.7% | ||
1 AND 2 | 239303 | 21.19 | 22.32 | 5.3% | |
title | 22.77 | 24.25 | 6.5% | ||
1 AND 2 AND 3 | 109513 | 18.83 | 19.17 | 1.8% | |
title | 19.30 | 20.06 | 3.9% | ||
1 AND 2 AND 3 AND 4 | 60795 | 16.21 | 17.51 | 8.0% | |
title | 16.75 | 18.29 | 9.2% | ||
"united states" | 528845 | 7.54 | 8.54 | 13.3% | |
title | 7.36 | 8.14 | 10.6% | ||
"united states of america" | 12144 | 20.64 | 21.48 | 4.1% | |
title | 20.45 | 21.06 | 3.0% | ||
un* | 2250238 | 9.31 | 11.54 | 24.0% | |
title | 8.42 | 10.96 | 30.2% | ||
*ent | 2482701 | 0.32 | 0.92 | 187.5% | |
title | 0.32 | 0.91 | 184.4% | ||
u*t | 169192 | 18.53 | 47.97 | 158.9% | |
title | 17.26 | 40.10 | 132.3% | ||
uni* | 1308332 | 18.54 | 23.49 | 26.7% | |
title | 16.28 | 20.02 | 23.0% | ||
un*t | 124623 | 62.13 | 105.23 | 69.4% | |
title | 50.38 | 74.99 | 48.8% | ||
?t | 554722 | 0.51 | 29.31 | 5647.1% | |
title | 0.51 | 26.25 | 5047.1% | ||
??t | 1605437 | 0.60 | 6.69 | 1015.0% | |
title | 0.60 | 6.22 | 936.7% | ||
???t | 3100067 | 0.54 | 1.92 | 255.6% | |
title | 0.53 | 1.89 | 256.6% | ||
????t | 2973045 | 0.51 | 0.71 | 39.2% | |
title | 0.51 | 0.70 | 37.3% | ||
?????t | 2323871 | 0.51 | 0.39 | -23.5% | |
title | 0.50 | 0.39 | -22.0% | ||
??????t | 2459025 | 0.49 | 0.31 | -36.7% | |
title | 0.48 | 0.15 | -68.7% | ||
un?t | 86664 | 92.45 | 241.46 | 161.2% | |
title | 72.59 | 151.28 | 108.4% | ||
un??t | 2860 | 222.11 | 408.52 | 83.9% | |
title | 220.91 | 405.84 | 83.7% | ||
un???t | 5828 | 117.38 | 99.64 | -15.1% | |
title | 111.47 | 98.64 | -11.5% | ||
un????t | 1426 | 207.03 | 100.60 | -51.4% | |
title | 207.23 | 101.36 | -51.1% | ||
united\~0.5 | 872873 | 0.35 | 0.31 | -11.4% | |
title | 0.35 | 0.31 | -11.4% | ||
united\~0.6 | 764041 | 0.46 | 5.22 | 1034.8% | |
title | 0.45 | 5.00 | 1011.1% | ||
united\~0.7 | 695756 | 0.59 | 21.19 | 3491.5% | |
title | 0.60 | 19.10 | 3083.3% | ||
united\~0.8 | 693134 | 0.59 | 21.44 | 3533.9% | |
title | 0.58 | 19.55 | 3270.7% | ||
united\~0.9 | 692299 | 57.06 | 67.80 | 18.8% | |
title | 55.28 | 57.87 | 4.7% |
I also ran the same queries through, but this time using the trunk (pre-flex) index with flex, ie to perf test the "flex on pre-flex" emulation layer. This is the initial experience users will see if they upgrade to flex but don't reindex:
Query | Tot hits | Sort | QPS trunk | QPS new | Pct change |
---|---|---|---|---|---|
1 | 591225 | 68.36 | 66.91 | -2.1% | |
title | 64.12 | 58.47 | -8.8% | ||
1 OR 2 | 953081 | 19.35 | 19.06 | -1.5% | |
title | 16.50 | 16.03 | -2.8% | ||
1 OR 2 OR 3 | 1131679 | 14.37 | 14.14 | -1.6% | |
title | 12.42 | 12.11 | -2.5% | ||
1 OR 2 OR 3 OR 4 | 1266805 | 10.94 | 11.61 | 6.1% | |
title | 10.36 | 10.04 | -3.1% | ||
1 AND 2 | 239303 | 21.19 | 21.12 | -0.3% | |
title | 22.77 | 22.46 | -1.4% | ||
1 AND 2 AND 3 | 109513 | 18.83 | 18.81 | -0.1% | |
title | 19.30 | 19.29 | -0.1% | ||
1 AND 2 AND 3 AND 4 | 60795 | 16.21 | 17.18 | 6.0% | |
title | 16.75 | 17.46 | 4.2% | ||
"united states" | 528845 | 7.54 | 7.63 | 1.2% | |
title | 7.36 | 7.12 | -3.3% | ||
"united states of america" | 12144 | 20.64 | 19.33 | -6.3% | |
title | 20.45 | 19.50 | -4.6% | ||
un* | 2250238 | 9.31 | 9.79 | 5.2% | |
title | 8.42 | 9.65 | 14.6% | ||
*ent | 2482701 | 0.32 | 0.45 | 40.6% | |
title | 0.32 | 0.45 | 40.6% | ||
u*t | 169192 | 18.53 | 24.75 | 33.6% | |
title | 17.26 | 21.96 | 27.2% | ||
uni* | 1308332 | 18.54 | 19.39 | 4.6% | |
title | 16.28 | 15.86 | -2.6% | ||
un*t | 124623 | 62.13 | 59.73 | -3.9% | |
title | 50.38 | 48.51 | -3.7% | ||
?t | 554722 | 0.51 | 23.65 | 4537.3% | |
title | 0.51 | 21.42 | 4100.0% | ||
??t | 1605437 | 0.60 | 5.13 | 755.0% | |
title | 0.60 | 4.61 | 668.3% | ||
???t | 3100067 | 0.54 | 1.28 | 137.0% | |
title | 0.53 | 1.24 | 134.0% | ||
????t | 2973045 | 0.51 | 0.55 | 7.8% | |
title | 0.51 | 0.54 | 5.9% | ||
?????t | 2323871 | 0.51 | 0.29 | -43.1% | |
title | 0.50 | 0.29 | -42.0% | ||
??????t | 2459025 | 0.49 | 0.18 | -63.3% | |
title | 0.48 | 0.21 | -56.2% | ||
un?t | 86664 | 92.45 | 202.48 | 119.0% | |
title | 72.59 | 134.55 | 85.4% | ||
un??t | 2860 | 222.11 | 187.05 | -15.8% | |
title | 220.91 | 186.81 | -15.4% | ||
un???t | 5828 | 117.38 | 69.30 | -41.0% | |
title | 111.47 | 68.59 | -38.5% | ||
un????t | 1426 | 207.03 | 60.98 | -70.5% | |
title | 207.23 | 60.62 | -70.7% | ||
united\~0.5 | 872873 | 0.35 | 0.23 | -34.3% | |
title | 0.35 | 0.23 | -34.3% | ||
united\~0.6 | 764041 | 0.46 | 3.84 | 734.8% | |
title | 0.45 | 3.76 | 735.6% | ||
united\~0.7 | 695756 | 0.59 | 17.45 | 2857.6% | |
title | 0.60 | 15.53 | 2488.3% | ||
united\~0.8 | 693134 | 0.59 | 17.56 | 2876.3% | |
title | 0.58 | 15.97 | 2653.4% | ||
united\~0.9 | 692299 | 57.06 | 56.02 | -1.8% | |
title | 55.28 | 49.26 | -10.9% |
There are alot of numbers to absorb... but here's my take:
Flex is generally faster.
Fuzzy queries and certain wildcard queries (using AutomatonQuery) are insanely faster.
There are certain specific wildcard corner cases where we are slower, but these are likely rarely used in practice (many ?'s followed by a suffix).
Flex API on a trunk index does take a perf hit but it looks contained enough that we don't need to spend any time optimizing that emulation layer...
I also ran an indexing test (index first 10M docs of wikipedia) and flex and trunk had similar times.
I think net/net we are good to land flex!
Spinoff from #2532.
The flex branch is in fairly good shape – all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;)
But it still has a number of nocommits, could use some more scrutiny especially on the "emulate old API on flex index" and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes.
The end is in sight!
Migrated from LUCENE-2111 by Michael McCandless (@mikemccand), resolved Apr 13 2010 Attachments: benchUtil.py, flex_backwards_merge_912395.patch, flex_merge_916543.patch, flexBench.py, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, LUCENE-2111_toString.patch, LUCENE-2111.patch (versions: 21), LUCENE-2111-EmptyTermsEnum.patch (versions: 2), Make2BTermsIndex.java Linked issues:
3217
3185
3186
2680