optimize MultiTermEnum/MultiTermDocs [LUCENE-1596]

asfimport commented 15 years ago

Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that don't match the term.

Migrated from LUCENE-1596 by Yonik Seeley (@yonik), resolved May 15 2009 Attachments: LUCENE-1596.patch

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Attaching optimization patch. Results up front: random seeks to common terms with term enumerator: 58% improvement full iteration over all docs matching relatively unique terms: 1595% improvement

The optimizations:

MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term.
MultiTermEnum defers calling next() on sub enumerators until needed. This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term. This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue. Using seek(enum) in the sub TermDocs also allows cascading of these optimizations (in the event that one has a MultiReader of MultiReaders).

Test index: this was obviously stacked to show best-case performance for these optimizations. 999,999 documents with maxBufferedDocs=10, resulting in 46 segments. The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

Do I interpret correctly that getting the docIds for terms that are (almost) primary keys on a non optimized index will become a lot faster with this patch?

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Yes, if you are doing low level stuff directly on the MultiReader, like using TermEnum/TermDocs. Or calling Filter.getBits(multiReader). As you probably know, if you pass Filters or Queries to an IndexSearcher, it's now dropping down to the segment level already (a scorer is created per-segment) so it won't hit the MultiTermEnum code.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch look good Yonik!

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Hmmm, unfortunately, this patch does nothing for RangeFilter.getDocIdSet(MultiReader), as the term enumerator used there is no longer a MultiTermEnum.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Really? It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader - so even when the filtered enum filters some terms, it should use the MultiTermEnum methods to iterate the terms.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader

Yes, but the MultiTermEnum has the internal knowledge about what segments matched the term, and actually has the sub-TermEnums positioned on that term.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yes I understand, the problematic call is TermDocs.seek(TermEnum) which does an instanceof check. If it is a FilteredTermEnum or anything other, which is wrapped, the optimization is not used.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Getting back to this... although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.

Thoughts/objections?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.

+1

It'd be great to somehow have the optimization apply to RangeFilter as well, though since Lucene has now moved to per-segment searching, it's low priority.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Committed.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm seeing this new AIOOBE when tracking down the intermittent failure in TestStressIndexing2:

1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2)
java.lang.ArrayIndexOutOfBoundsException: 6
    at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672)
    at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292)
    at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250)
    at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:585)
    at junit.framework.TestCase.runTest(TestCase.java:168)
    at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
    at junit.framework.TestCase.runBare(TestCase.java:134)
    at junit.framework.TestResult$1.protect(TestResult.java:110)
    at junit.framework.TestResult.runProtected(TestResult.java:128)
    at junit.framework.TestResult.run(TestResult.java:113)
    at junit.framework.TestCase.run(TestCase.java:124)
    at junit.framework.TestSuite.runTest(TestSuite.java:232)
    at junit.framework.TestSuite.run(TestSuite.java:227)
    at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
    at org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33)
    at org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:109)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:100)
    at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81)
    at org.junit.runner.JUnitCore.main(JUnitCore.java:44)

I think it's because this optimization isn't admissible in the case when one calls MultiTermDocs.seek on a MultiTermEnum derived from a different MultiSegmentReader. Ie, I think there needs to be another check that verifies in fact the MultiTermEnum passed to MultiTermDocs.seek share the same MultiSegmentReader?

Before this optimiztion, this was OK (only the term was used from the MultiTermEnum).

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum not derived from the same reader. I'll fix.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I just committed the fix (since trunk was broken) and a test that failed w/o the fix, but if anyone has a better idea how to handle/fix, we can certainly still discuss. I just did the obvious - store the multi-reader in the TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum)

apache / lucene

optimize MultiTermEnum/MultiTermDocs [LUCENE-1596] #2670