Wrapup flexible indexing [LUCENE-2111]

asfimport commented 14 years ago

Spinoff from #2532.

The flex branch is in fairly good shape – all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;)

But it still has a number of nocommits, could use some more scrutiny especially on the "emulate old API on flex index" and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes.

The end is in sight!

Migrated from LUCENE-2111 by Michael McCandless (@mikemccand), resolved Apr 13 2010 Attachments: benchUtil.py, flex_backwards_merge_912395.patch, flex_merge_916543.patch, flexBench.py, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, LUCENE-2111_toString.patch, LUCENE-2111.patch (versions: 21), LUCENE-2111-EmptyTermsEnum.patch (versions: 2), Make2BTermsIndex.java Linked issues:

3217
- 3185
- 3186
- 2680

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Also the current flex branch produces lots of unchecked warnings... The Generics policeman will visit them and will hopefully help fixing!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The Generics policeman will visit them and will hopefully help fixing!

Uh-oh... I sense heavy committing in flex branch's future!

asfimport commented 14 years ago

Karthik K (migrated from JIRA)

What would the branch name for flex indexing ?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I Mike,

maybe I found an issue in LegacyFieldsEnum & Co:

The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.
Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?

Uh-oh... I sense heavy committing in flex branch's future!

It is not so much, mainly in code added before the final Java 1.5 switch. Should be easy to fix, will look into it in a few days. So the police inspector only has few minor complaints :-)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What would the branch name for flex indexing ?

It's https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.

Nice catch – I'll open a new issue & fix.

Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?

I actually intentionally left .close() out of all *Enums.

First, to strongly bias impls from doing such costly things that close is necessary. These enums are used in hotspots during searching.

Second, because for Lucene's core impls, close() is [almost?] always a no-op: these impls are very lightweight.

Third, because in Lucene we don't consistently close the enums we pull, today, so we have confusion (there have been posts to java-user about this – "do I need to close the TermEnum/TermDocs"). I'd rather not add a "close" that for all core impls is a no-op and so Lucene doesn't have to call close.

Fourth, because it complicates our impls if we really must close whenever we pull a enum – eg our Scorers pull enums today, but never close them.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch – will commit soon. I found a bug in the "flex API on non-flex" layer (the preflex codec) – exposed with a new test case in TestBackCompat, and fixed. Also cleaned up some nocommits, added indexDivisor to the loadTermsIndex API, and fixed preflex to actually implement it.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, which adds some nice test coverage of the back compat layers. We still need more tests but this is a good step forward...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch attached – strengthened the back compat testing, which uncovered some issues in the back compat layers, that I've now fixed.

I ran the FlexTestUtil.verifyFlexVsPreFlex on a 5M doc pre-flex wikipedia index, and a 1M doc flex wikipedia index, with no problems.

I also ran my NRT stress test, starting from a pre-flex 5M wikipedia, and indexing using flex, so this is a good test of mixed pre/post flex segments, with no problems.

Getting closer...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, changing oal.index.TermRef -> oal.util.BytesRef.

I think, eventually, we should fix the various places that refer to byte slices, eg Field.get/setBinary*, Payload, UnicodeUtil.UTF8Result, IndexOutput.writeBytes, IndexInput.readBytes, to use BytesRef instead.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch w/ various fixes:

Switch over payloads to use BytesRef, in flex API
DocsEnum.positions now returns null if no positions were indexed (ie omitTFAP was set for the field). Also fixed Phrase/SpanQuery to throw IllegalStateException when run against an omitTFAP field.
Rename PositionsConsumer.addPosition -> .add

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch with a major reworking of some parts of flex:

Simplified how the StandardTermsDictReader/Writer interacts with the postings impl. The PostingsReader for the codec is now stateless, capturing all state for a given term in a dedicated TermState class (which also works well w/ caching, since we needed to capture state for that anyway).
Merged docs & positions readers, in the codec's impl and in the exposed flex API. It was just too hairy before, with separate classes for reading docs & positions. This is a step back towards current trunk API, ie, up front you ask for either a DocsEnum or a DocsAndPositionsEnum.
Modified API semantics: if a field or term does not exist, then IndexReader.termDocs/PositionsEnum may now return null (previously they returned a fake empty enum). This means more Weight.scorer() may return null.
I added IndexReader.getSubReaderDocBase (there is a separate jira issue open for this) – this is now more important because a filter can no longer guess its doc base by adding up docCount of all readers it sees since if the scorer for that segment is null, Filter.getDocIdSet will not be called.
Changed the reuse of Docs/AndPositionsEnum to be explicit. Previously the Terms or TermsEnum instance was holding a private reused instance... but that was no good because typically we can share the TermsEnum but cannot share postings enums.
Likeways, changed the public flex reading API, so that you don't separately ask for positions enum at each doc. Instead, up front you either ask for a DocsEnum or a DocsAndPositionsEnum. This matches how the current Lucene APIs work.
Terms dict cache is now at the top level, not per field (this matches how trunk works, ie all fields share the 1024 sized cache)

I cutover all codecs to the new API... all tests pass if you switch the default codec (in oal.index.codec.Codecs.getWriter) to any of the four.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New flex patch attached:

I factored out separate public Multi* (Fields, Terms, etc.) from DirectoryReader, DirectoryReader. These classes merge multiple flex "sub readers" into a single flex API on the fly.
Refactored all places that need to merge sub-readers to use this API (DirectoryReader, MultiReader, SegmentMerger). This is cleaner because previously SegmentMerger had its own duplicated code for doing this merging; now we have a single source for it (though merging swaps in its own docs/positions enum, to remap docIDs around deletions).
Changed the semantics of IndexReader.fields() – for a multi reader (any reader that consist of sequential sub readers), fields() now throws UOE.

This is an important change with flex – the caller now bears responsibility for create a MultiFields if they really need it.

My thinking is that primary places in Lucene that consume postings now operate per-segment, so a multi reader (Dir/MultiReader) should not automatically "join up high" because it entails a hidden performance hit. So consumers that must access the flex API at the multi reader level should be explicit about it...

However, to make this simple, I created a sugar static methods on MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do this, and cutover places in Lucene that may need direct postings from a multi-reader to use this method.

I've updated the javadocs explaining this.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, here is a patch for removal of fuzzy nocommits:

remove synchronization (not necessary, history here: LUCENE-296)
reuse char[] rather than create Strings
remove unused ctors

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

btw, i benched that patch with my contrived benchmark for #3165, wierd that flex was slower than trunk before. numbers are stable across many iterations.

unpatched flex	patched flex	trunk
4362ms	3239ms	3459ms

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here the EmptyTermsEnum singleton patch (against flex trunk).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch for empty TermsEnum. It is now a singleton in TermsEnum class itsself.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch looks good Uwe – thanks for re-merging!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

Ahh, and also because your patch switches from String to char[], which should improve perf.

Your patch looks good Robert! Thanks.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Ahh, and also because your patch switches from String to char[], which should improve perf.

actually i didnt apply your LUCENE-2111 when running the benchmark (the improvement is simply the char[]). the test is now actually slightly slower now with the rest of LUCENE-2111

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here is a rough patch, that merges BytesRef/UnicodeUtil

added a fn to UnicodeUtil that computes the hash as it encodes, for termshash
when doing utf16->utf8 conversion, remove an if for every character by simple allocating utf16*4
some method signatures were generalized from String -> CharSequence (i.e. so you could create a BytesRef from a StringBuilder if you want)

there are some breaks (e.g. binary api compat), but its an internal api.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Robert! Thanks :)

Why not just remove UTF8Result altogether? (ie don't bother deprecating). It's an internal API...

The new method to compute hash is great, saving the extra pass in THPF.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, i ditched UTF8 result entirely, committed in revision 915511

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch that changes various exposed apis to use @lucene.experimental

i didnt mess with IndexFileNames as there is an open issue about it right now.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

these tags are added in revision 915791.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, fixing some more nocommits, and renaming BytesRef.toString -> BytesRef.utf8ToString.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a few more toString -> utf8ToString. will look at the backwards tests now

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

a few more easy nocommits

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, fixes flex APIs to not return null (instead return .EMPTY objects).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch for review of flex merge

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch for review, backwards tests merge to flex

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Cuts over to returning .EMPTY instead of null when requesting enums.

Also disallows Multi/DirReader.getDeletedDocs in favor of static convenience method MultiFields.getDeletedDocs.

But... we now need a way to determine that a codec does not store positions. I [illegally, adding nocommits] had to add a few places where I check the return result against .EMPTY.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But... we now need a way to determine that a codec does not store positions.

Thinking more about this... I think we should switch back to a null return from .docsAndPositionsEnum if the codec doesn't support positions. We only return .EMPTY if the enum is really just empty.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch for MTQ to not use null in getTermsEnum for backwards compat, so null can have some other meaning. Instead it uses VirtualMethod, with the default implementatinos throwing UOE.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe asked for a test for the MTQ back compat... attached is one. I think it looks kinda dumb but if its useful, I'll commit it.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Changed .iterator() to never return null, MTQ.getTermsEnum() to never return null, but IR.fields() and Fields.terms(String field), and .docs/.docsAndPositions can return null.

Also whittled down more nocommits – down to 53 now!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Back compat test for MTQ looks good Robert... thanks!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Down to 15 nocommits!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Same patch, just fixes the Java 1.6 only changes (adding @Override to interface).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New rev, just a few changes:

Rename BytesRef.toBytesString -> toString (thanks Robert!)
No more BytesRef.Comparator – just use java.util.Comparator<BytesRef> (thanks Uwe!)
Use Set<String> not Collection<String> when asking Codec for its files/extensions (thanks Mark!)

I'll commit (on flex branch) sometime today. Making me nervous carrying such a large patch...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

More nocommit reductions and other fixes:

Rename Codecs -> CodecProvider
Try to optimize flex API on pre-flex index (don't clone the SegmentTermEnum during seek)
Use SegmentReadState class when getting fieldsProducer (matches SegmentWriteState)
Cuts over to a new bulk-read API on DocsEnum that's designed to let int block codecs provide direct access to the int[] they have (saves extra copy).
Down to 9 nocommits!!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Forgot to add SegmentReadState.java

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also adds reuse to pre-flex API when getting a new TermsEnum.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Shai!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Duh – wrong issue! I only wish.... ;)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, eliminating all remaining nocommits on flex branch! I turned most of them into TODOs :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm benchmarking flex vs trunk, but uncovered a strange performance loss with WildcardQuery. I'm attaching the python wrapper around contrib/benchmark that I'm using. Hopefully this is something silly...

You have to edit flexBench.py, specificaly the TRUNK_DIR and FLEX_DIR must point to the .../contrib/benchmark of each source area, and you have to edit the WIKI_LINE_FILE and/or WIKI_FILE (I think WIKI_LINE_FILE can be None in which case it should (but I haven't tested recently!) fallback to parsing the .xml.bz2 wikipedia export).

I'll first build an index of the first 5M wikipedia docs, once for flex and once for trunk, and then run the test queries. It also tests the "flex API on trunk index" case, to test perf of the flex emulation layer... this layer is looking a bit slowish now but I'm not sure how much we can do to speed it up...

Run like this:

python -u flexBench.py -run test

I have it set to only test only the wildcard query uni*t right now... and I'm getting this result:

JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)

OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

Index /x/lucene/flex.work.wiki.nd5M already exists...
Index /x/lucene/trunk.work.wiki.nd5M already exists...
Index /x/lucene/flex.work.random.nd5M already exists...
Index /x/lucene/trunk.work.random.nd5M already exists...

RUN: source=wiki query=un*t sort=None
  run trunk...
    cd /root/src/clean/lucene/contrib/benchmark
    log: /root/src/clean/lucene/contrib/benchmark/logs/trunk.0
    62.49 QPS
  run flex on trunk index...
    cd /root/src/flex.clean/contrib/benchmark
    log: /root/src/flex.clean/contrib/benchmark/logs/flexOnTrunk.1
    25.87 QPS [-58.6% worse]
  run flex on flex index...
    cd /root/src/flex.clean/contrib/benchmark
    log: /root/src/flex.clean/contrib/benchmark/logs/flexOnFlex.2
    39.30 QPS [-37.1% worse]
  124623 hits

Other queries I've tested look OK so far...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Towards wrapping up flex, I ran a set of tests to benchmark flex's search performance vs trunk.

All tests are on a 5M doc Wikipedia index, best qps of 5 runs where each run runs the query for 5.0 seconds. Env is:

JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)

OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

First table compares trunk against "flex on flex", ie, a flex index (fully reindexed after upgrading to flex):

Query	Tot hits	Sort	QPS trunk	QPS new
1	591225	68.36	80.64	18.0%
title	64.12	68.53	6.9%
1 OR 2	953081	19.35	20.80	7.5%
title	16.50	17.48	5.9%
1 OR 2 OR 3	1131679	14.37	15.50	7.9%
title	12.42	13.26	6.8%
1 OR 2 OR 3 OR 4	1266805	10.94	12.76	16.6%
title	10.36	11.05	6.7%
1 AND 2	239303	21.19	22.32	5.3%
title	22.77	24.25	6.5%
1 AND 2 AND 3	109513	18.83	19.17	1.8%
title	19.30	20.06	3.9%
1 AND 2 AND 3 AND 4	60795	16.21	17.51	8.0%
title	16.75	18.29	9.2%
"united states"	528845	7.54	8.54	13.3%
title	7.36	8.14	10.6%
"united states of america"	12144	20.64	21.48	4.1%
title	20.45	21.06	3.0%
un*	2250238	9.31	11.54	24.0%
title	8.42	10.96	30.2%
*ent	2482701	0.32	0.92	187.5%
title	0.32	0.91	184.4%
u*t	169192	18.53	47.97	158.9%
title	17.26	40.10	132.3%
uni*	1308332	18.54	23.49	26.7%
title	16.28	20.02	23.0%
un*t	124623	62.13	105.23	69.4%
title	50.38	74.99	48.8%
?t	554722	0.51	29.31	5647.1%
title	0.51	26.25	5047.1%
??t	1605437	0.60	6.69	1015.0%
title	0.60	6.22	936.7%
???t	3100067	0.54	1.92	255.6%
title	0.53	1.89	256.6%
????t	2973045	0.51	0.71	39.2%
title	0.51	0.70	37.3%
?????t	2323871	0.51	0.39	-23.5%
title	0.50	0.39	-22.0%
??????t	2459025	0.49	0.31	-36.7%
title	0.48	0.15	-68.7%
un?t	86664	92.45	241.46	161.2%
title	72.59	151.28	108.4%
un??t	2860	222.11	408.52	83.9%
title	220.91	405.84	83.7%
un???t	5828	117.38	99.64	-15.1%
title	111.47	98.64	-11.5%
un????t	1426	207.03	100.60	-51.4%
title	207.23	101.36	-51.1%
united\~0.5	872873	0.35	0.31	-11.4%
title	0.35	0.31	-11.4%
united\~0.6	764041	0.46	5.22	1034.8%
title	0.45	5.00	1011.1%
united\~0.7	695756	0.59	21.19	3491.5%
title	0.60	19.10	3083.3%
united\~0.8	693134	0.59	21.44	3533.9%
title	0.58	19.55	3270.7%
united\~0.9	692299	57.06	67.80	18.8%
title	55.28	57.87	4.7%

I also ran the same queries through, but this time using the trunk (pre-flex) index with flex, ie to perf test the "flex on pre-flex" emulation layer. This is the initial experience users will see if they upgrade to flex but don't reindex:

Query	Tot hits	Sort	QPS trunk	QPS new
1	591225	68.36	66.91	-2.1%
title	64.12	58.47	-8.8%
1 OR 2	953081	19.35	19.06	-1.5%
title	16.50	16.03	-2.8%
1 OR 2 OR 3	1131679	14.37	14.14	-1.6%
title	12.42	12.11	-2.5%
1 OR 2 OR 3 OR 4	1266805	10.94	11.61	6.1%
title	10.36	10.04	-3.1%
1 AND 2	239303	21.19	21.12	-0.3%
title	22.77	22.46	-1.4%
1 AND 2 AND 3	109513	18.83	18.81	-0.1%
title	19.30	19.29	-0.1%
1 AND 2 AND 3 AND 4	60795	16.21	17.18	6.0%
title	16.75	17.46	4.2%
"united states"	528845	7.54	7.63	1.2%
title	7.36	7.12	-3.3%
"united states of america"	12144	20.64	19.33	-6.3%
title	20.45	19.50	-4.6%
un*	2250238	9.31	9.79	5.2%
title	8.42	9.65	14.6%
*ent	2482701	0.32	0.45	40.6%
title	0.32	0.45	40.6%
u*t	169192	18.53	24.75	33.6%
title	17.26	21.96	27.2%
uni*	1308332	18.54	19.39	4.6%
title	16.28	15.86	-2.6%
un*t	124623	62.13	59.73	-3.9%
title	50.38	48.51	-3.7%
?t	554722	0.51	23.65	4537.3%
title	0.51	21.42	4100.0%
??t	1605437	0.60	5.13	755.0%
title	0.60	4.61	668.3%
???t	3100067	0.54	1.28	137.0%
title	0.53	1.24	134.0%
????t	2973045	0.51	0.55	7.8%
title	0.51	0.54	5.9%
?????t	2323871	0.51	0.29	-43.1%
title	0.50	0.29	-42.0%
??????t	2459025	0.49	0.18	-63.3%
title	0.48	0.21	-56.2%
un?t	86664	92.45	202.48	119.0%
title	72.59	134.55	85.4%
un??t	2860	222.11	187.05	-15.8%
title	220.91	186.81	-15.4%
un???t	5828	117.38	69.30	-41.0%
title	111.47	68.59	-38.5%
un????t	1426	207.03	60.98	-70.5%
title	207.23	60.62	-70.7%
united\~0.5	872873	0.35	0.23	-34.3%
title	0.35	0.23	-34.3%
united\~0.6	764041	0.46	3.84	734.8%
title	0.45	3.76	735.6%
united\~0.7	695756	0.59	17.45	2857.6%
title	0.60	15.53	2488.3%
united\~0.8	693134	0.59	17.56	2876.3%
title	0.58	15.97	2653.4%
united\~0.9	692299	57.06	56.02	-1.8%
title	55.28	49.26	-10.9%

There are alot of numbers to absorb... but here's my take:

Flex is generally faster.
Fuzzy queries and certain wildcard queries (using AutomatonQuery) are insanely faster.
There are certain specific wildcard corner cases where we are slower, but these are likely rarely used in practice (many ?'s followed by a suffix).
Flex API on a trunk index does take a perf hit but it looks contained enough that we don't need to spend any time optimizing that emulation layer...

I also ran an indexing test (index first 10M docs of wikipedia) and flex and trunk had similar times.

I think net/net we are good to land flex!

apache / lucene

Wrapup flexible indexing [LUCENE-2111] #3187

3217

3185

3186

2680