apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.71k stars 1.04k forks source link

Wrapup flexible indexing [LUCENE-2111] #3187

Closed asfimport closed 14 years ago

asfimport commented 14 years ago

Spinoff from #2532.

The flex branch is in fairly good shape – all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;)

But it still has a number of nocommits, could use some more scrutiny especially on the "emulate old API on flex index" and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes.

The end is in sight!


Migrated from LUCENE-2111 by Michael McCandless (@mikemccand), resolved Apr 13 2010 Attachments: benchUtil.py, flex_backwards_merge_912395.patch, flex_merge_916543.patch, flexBench.py, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, LUCENE-2111_toString.patch, LUCENE-2111.patch (versions: 21), LUCENE-2111-EmptyTermsEnum.patch (versions: 2), Make2BTermsIndex.java Linked issues:

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Also the current flex branch produces lots of unchecked warnings... The Generics policeman will visit them and will hopefully help fixing!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The Generics policeman will visit them and will hopefully help fixing!

Uh-oh... I sense heavy committing in flex branch's future!

asfimport commented 14 years ago

Karthik K (migrated from JIRA)

What would the branch name for flex indexing ?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I Mike,

maybe I found an issue in LegacyFieldsEnum & Co:

Uh-oh... I sense heavy committing in flex branch's future!

It is not so much, mainly in code added before the final Java 1.5 switch. Should be easy to fix, will look into it in a few days. So the police inspector only has few minor complaints :-)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What would the branch name for flex indexing ?

It's https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The LegacyTermsEnum correctly seeks to the first Term/TermRef but as the deprec TermEnum still iterates after the last term, the TermsEnum returns all terms from all later fields, too. So seek() and next() should also do a field==Term.field() comparison and set SeekStatus/return value correctly.

Nice catch – I'll open a new issue & fix.

Also (you see it because missing @Override): The new abstract Enums have no close method / not implement java.io.Closeable, so the underlying old enums are never closed by client code. Shouldn't all the enum classes not also be Closeable, even when the new Codec API current would implement these as a no-op for core classes. But maybe someone creates a codec that needs close?

I actually intentionally left .close() out of all *Enums.

First, to strongly bias impls from doing such costly things that close is necessary. These enums are used in hotspots during searching.

Second, because for Lucene's core impls, close() is [almost?] always a no-op: these impls are very lightweight.

Third, because in Lucene we don't consistently close the enums we pull, today, so we have confusion (there have been posts to java-user about this – "do I need to close the TermEnum/TermDocs"). I'd rather not add a "close" that for all core impls is a no-op and so Lucene doesn't have to call close.

Fourth, because it complicates our impls if we really must close whenever we pull a enum – eg our Scorers pull enums today, but never close them.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch – will commit soon. I found a bug in the "flex API on non-flex" layer (the preflex codec) – exposed with a new test case in TestBackCompat, and fixed. Also cleaned up some nocommits, added indexDivisor to the loadTermsIndex API, and fixed preflex to actually implement it.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, which adds some nice test coverage of the back compat layers. We still need more tests but this is a good step forward...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch attached – strengthened the back compat testing, which uncovered some issues in the back compat layers, that I've now fixed.

I ran the FlexTestUtil.verifyFlexVsPreFlex on a 5M doc pre-flex wikipedia index, and a 1M doc flex wikipedia index, with no problems.

I also ran my NRT stress test, starting from a pre-flex 5M wikipedia, and indexing using flex, so this is a good test of mixed pre/post flex segments, with no problems.

Getting closer...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, changing oal.index.TermRef -> oal.util.BytesRef.

I think, eventually, we should fix the various places that refer to byte slices, eg Field.get/setBinary*, Payload, UnicodeUtil.UTF8Result, IndexOutput.writeBytes, IndexInput.readBytes, to use BytesRef instead.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch w/ various fixes:

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch with a major reworking of some parts of flex:

I cutover all codecs to the new API... all tests pass if you switch the default codec (in oal.index.codec.Codecs.getWriter) to any of the four.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New flex patch attached:

This is an important change with flex – the caller now bears responsibility for create a MultiFields if they really need it.

My thinking is that primary places in Lucene that consume postings now operate per-segment, so a multi reader (Dir/MultiReader) should not automatically "join up high" because it entails a hidden performance hit. So consumers that must access the flex API at the multi reader level should be explicit about it...

However, to make this simple, I created a sugar static methods on MultiFields (eg, MultiFields.getFields(IndexReader)) to easily do this, and cutover places in Lucene that may need direct postings from a multi-reader to use this method.

I've updated the javadocs explaining this.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, here is a patch for removal of fuzzy nocommits:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

btw, i benched that patch with my contrived benchmark for #3165, wierd that flex was slower than trunk before. numbers are stable across many iterations.

unpatched flex patched flex trunk
4362ms 3239ms 3459ms
asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Mike: I reviewed this EmptyTermsEnum in MTQ. I would leave it in, but simply make EmptyTermsEnum a singleton (which is perfectly fine, because its stateless). Returning null here makes no performance in MTQs, it only makes the code in MTQ#rewrite and MTQWF#getDocIdSet ugly. The biggest problem with returning null here is the backwards layer that must be fixed then (because it checks if getTermsEnum return null and falls back to FilteredTermEnum from trunk). If you really want null, getTermsEnum should per default (if not overriddden) throw UOE and the rewrite code should catch this UOE and only then delegate to backwards layer.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here the EmptyTermsEnum singleton patch (against flex trunk).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch for empty TermsEnum. It is now a singleton in TermsEnum class itsself.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch looks good Uwe – thanks for re-merging!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Robert, I think flex was faster before because the previous impl of Multi*Enums was using the same Docs/AndPositionsEnums before. This patch fixes that.

Ahh, and also because your patch switches from String to char[], which should improve perf.

Your patch looks good Robert! Thanks.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Ahh, and also because your patch switches from String to char[], which should improve perf.

actually i didnt apply your LUCENE-2111 when running the benchmark (the improvement is simply the char[]). the test is now actually slightly slower now with the rest of LUCENE-2111

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here is a rough patch, that merges BytesRef/UnicodeUtil

there are some breaks (e.g. binary api compat), but its an internal api.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Robert! Thanks :)

Why not just remove UTF8Result altogether? (ie don't bother deprecating). It's an internal API...

The new method to compute hash is great, saving the extra pass in THPF.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, i ditched UTF8 result entirely, committed in revision 915511

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch that changes various exposed apis to use @lucene.experimental

i didnt mess with IndexFileNames as there is an open issue about it right now.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

these tags are added in revision 915791.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, fixing some more nocommits, and renaming BytesRef.toString -> BytesRef.utf8ToString.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a few more toString -> utf8ToString. will look at the backwards tests now

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

a few more easy nocommits

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, fixes flex APIs to not return null (instead return .EMPTY objects).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch for review of flex merge

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch for review, backwards tests merge to flex

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Cuts over to returning .EMPTY instead of null when requesting enums.

Also disallows Multi/DirReader.getDeletedDocs in favor of static convenience method MultiFields.getDeletedDocs.

But... we now need a way to determine that a codec does not store positions. I [illegally, adding nocommits] had to add a few places where I check the return result against .EMPTY.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But... we now need a way to determine that a codec does not store positions.

Thinking more about this... I think we should switch back to a null return from .docsAndPositionsEnum if the codec doesn't support positions. We only return .EMPTY if the enum is really just empty.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch for MTQ to not use null in getTermsEnum for backwards compat, so null can have some other meaning. Instead it uses VirtualMethod, with the default implementatinos throwing UOE.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe asked for a test for the MTQ back compat... attached is one. I think it looks kinda dumb but if its useful, I'll commit it.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Changed .iterator() to never return null, MTQ.getTermsEnum() to never return null, but IR.fields() and Fields.terms(String field), and .docs/.docsAndPositions can return null.

Also whittled down more nocommits – down to 53 now!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Back compat test for MTQ looks good Robert... thanks!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Down to 15 nocommits!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Same patch, just fixes the Java 1.6 only changes (adding @Override to interface).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New rev, just a few changes:

I'll commit (on flex branch) sometime today. Making me nervous carrying such a large patch...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

More nocommit reductions and other fixes:

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Forgot to add SegmentReadState.java

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also adds reuse to pre-flex API when getting a new TermsEnum.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Shai!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Duh – wrong issue! I only wish.... ;)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch, eliminating all remaining nocommits on flex branch! I turned most of them into TODOs :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm benchmarking flex vs trunk, but uncovered a strange performance loss with WildcardQuery. I'm attaching the python wrapper around contrib/benchmark that I'm using. Hopefully this is something silly...

You have to edit flexBench.py, specificaly the TRUNK_DIR and FLEX_DIR must point to the .../contrib/benchmark of each source area, and you have to edit the WIKI_LINE_FILE and/or WIKI_FILE (I think WIKI_LINE_FILE can be None in which case it should (but I haven't tested recently!) fallback to parsing the .xml.bz2 wikipedia export).

I'll first build an index of the first 5M wikipedia docs, once for flex and once for trunk, and then run the test queries. It also tests the "flex API on trunk index" case, to test perf of the flex emulation layer... this layer is looking a bit slowish now but I'm not sure how much we can do to speed it up...

Run like this:

python -u flexBench.py -run test

I have it set to only test only the wildcard query uni*t right now... and I'm getting this result:

JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)

OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

Index /x/lucene/flex.work.wiki.nd5M already exists...
Index /x/lucene/trunk.work.wiki.nd5M already exists...
Index /x/lucene/flex.work.random.nd5M already exists...
Index /x/lucene/trunk.work.random.nd5M already exists...

RUN: source=wiki query=un*t sort=None
  run trunk...
    cd /root/src/clean/lucene/contrib/benchmark
    log: /root/src/clean/lucene/contrib/benchmark/logs/trunk.0
    62.49 QPS
  run flex on trunk index...
    cd /root/src/flex.clean/contrib/benchmark
    log: /root/src/flex.clean/contrib/benchmark/logs/flexOnTrunk.1
    25.87 QPS [-58.6% worse]
  run flex on flex index...
    cd /root/src/flex.clean/contrib/benchmark
    log: /root/src/flex.clean/contrib/benchmark/logs/flexOnFlex.2
    39.30 QPS [-37.1% worse]
  124623 hits

Other queries I've tested look OK so far...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Towards wrapping up flex, I ran a set of tests to benchmark flex's search performance vs trunk.

All tests are on a 5M doc Wikipedia index, best qps of 5 runs where each run runs the query for 5.0 seconds. Env is:

JAVA:
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)

OS:
Linux centos 2.6.18-164.6.1.el5 `#1` SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

First table compares trunk against "flex on flex", ie, a flex index (fully reindexed after upgrading to flex):

Query Tot hits Sort QPS trunk QPS new Pct change
1 591225 68.36 80.64 18.0%
title 64.12 68.53 6.9%
1 OR 2 953081 19.35 20.80 7.5%
title 16.50 17.48 5.9%
1 OR 2 OR 3 1131679 14.37 15.50 7.9%
title 12.42 13.26 6.8%
1 OR 2 OR 3 OR 4 1266805 10.94 12.76 16.6%
title 10.36 11.05 6.7%
1 AND 2 239303 21.19 22.32 5.3%
title 22.77 24.25 6.5%
1 AND 2 AND 3 109513 18.83 19.17 1.8%
title 19.30 20.06 3.9%
1 AND 2 AND 3 AND 4 60795 16.21 17.51 8.0%
title 16.75 18.29 9.2%
"united states" 528845 7.54 8.54 13.3%
title 7.36 8.14 10.6%
"united states of america" 12144 20.64 21.48 4.1%
title 20.45 21.06 3.0%
un* 2250238 9.31 11.54 24.0%
title 8.42 10.96 30.2%
*ent 2482701 0.32 0.92 187.5%
title 0.32 0.91 184.4%
u*t 169192 18.53 47.97 158.9%
title 17.26 40.10 132.3%
uni* 1308332 18.54 23.49 26.7%
title 16.28 20.02 23.0%
un*t 124623 62.13 105.23 69.4%
title 50.38 74.99 48.8%
?t 554722 0.51 29.31 5647.1%
title 0.51 26.25 5047.1%
??t 1605437 0.60 6.69 1015.0%
title 0.60 6.22 936.7%
???t 3100067 0.54 1.92 255.6%
title 0.53 1.89 256.6%
????t 2973045 0.51 0.71 39.2%
title 0.51 0.70 37.3%
?????t 2323871 0.51 0.39 -23.5%
title 0.50 0.39 -22.0%
??????t 2459025 0.49 0.31 -36.7%
title 0.48 0.15 -68.7%
un?t 86664 92.45 241.46 161.2%
title 72.59 151.28 108.4%
un??t 2860 222.11 408.52 83.9%
title 220.91 405.84 83.7%
un???t 5828 117.38 99.64 -15.1%
title 111.47 98.64 -11.5%
un????t 1426 207.03 100.60 -51.4%
title 207.23 101.36 -51.1%
united\~0.5 872873 0.35 0.31 -11.4%
title 0.35 0.31 -11.4%
united\~0.6 764041 0.46 5.22 1034.8%
title 0.45 5.00 1011.1%
united\~0.7 695756 0.59 21.19 3491.5%
title 0.60 19.10 3083.3%
united\~0.8 693134 0.59 21.44 3533.9%
title 0.58 19.55 3270.7%
united\~0.9 692299 57.06 67.80 18.8%
title 55.28 57.87 4.7%

I also ran the same queries through, but this time using the trunk (pre-flex) index with flex, ie to perf test the "flex on pre-flex" emulation layer. This is the initial experience users will see if they upgrade to flex but don't reindex:

Query Tot hits Sort QPS trunk QPS new Pct change
1 591225 68.36 66.91 -2.1%
title 64.12 58.47 -8.8%
1 OR 2 953081 19.35 19.06 -1.5%
title 16.50 16.03 -2.8%
1 OR 2 OR 3 1131679 14.37 14.14 -1.6%
title 12.42 12.11 -2.5%
1 OR 2 OR 3 OR 4 1266805 10.94 11.61 6.1%
title 10.36 10.04 -3.1%
1 AND 2 239303 21.19 21.12 -0.3%
title 22.77 22.46 -1.4%
1 AND 2 AND 3 109513 18.83 18.81 -0.1%
title 19.30 19.29 -0.1%
1 AND 2 AND 3 AND 4 60795 16.21 17.18 6.0%
title 16.75 17.46 4.2%
"united states" 528845 7.54 7.63 1.2%
title 7.36 7.12 -3.3%
"united states of america" 12144 20.64 19.33 -6.3%
title 20.45 19.50 -4.6%
un* 2250238 9.31 9.79 5.2%
title 8.42 9.65 14.6%
*ent 2482701 0.32 0.45 40.6%
title 0.32 0.45 40.6%
u*t 169192 18.53 24.75 33.6%
title 17.26 21.96 27.2%
uni* 1308332 18.54 19.39 4.6%
title 16.28 15.86 -2.6%
un*t 124623 62.13 59.73 -3.9%
title 50.38 48.51 -3.7%
?t 554722 0.51 23.65 4537.3%
title 0.51 21.42 4100.0%
??t 1605437 0.60 5.13 755.0%
title 0.60 4.61 668.3%
???t 3100067 0.54 1.28 137.0%
title 0.53 1.24 134.0%
????t 2973045 0.51 0.55 7.8%
title 0.51 0.54 5.9%
?????t 2323871 0.51 0.29 -43.1%
title 0.50 0.29 -42.0%
??????t 2459025 0.49 0.18 -63.3%
title 0.48 0.21 -56.2%
un?t 86664 92.45 202.48 119.0%
title 72.59 134.55 85.4%
un??t 2860 222.11 187.05 -15.8%
title 220.91 186.81 -15.4%
un???t 5828 117.38 69.30 -41.0%
title 111.47 68.59 -38.5%
un????t 1426 207.03 60.98 -70.5%
title 207.23 60.62 -70.7%
united\~0.5 872873 0.35 0.23 -34.3%
title 0.35 0.23 -34.3%
united\~0.6 764041 0.46 3.84 734.8%
title 0.45 3.76 735.6%
united\~0.7 695756 0.59 17.45 2857.6%
title 0.60 15.53 2488.3%
united\~0.8 693134 0.59 17.56 2876.3%
title 0.58 15.97 2653.4%
united\~0.9 692299 57.06 56.02 -1.8%
title 55.28 49.26 -10.9%

There are alot of numbers to absorb... but here's my take:

I also ran an indexing test (index first 10M docs of wikipedia) and flex and trunk had similar times.

I think net/net we are good to land flex!