Closed asfimport closed 15 years ago
Yonik Seeley (@yonik) (migrated from JIRA)
Attaching optimization patch. Results up front: random seeks to common terms with term enumerator: 58% improvement full iteration over all docs matching relatively unique terms: 1595% improvement
The optimizations:
Test index: this was obviously stacked to show best-case performance for these optimizations. 999,999 documents with maxBufferedDocs=10, resulting in 46 segments. The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)
Paul Elschot (migrated from JIRA)
Do I interpret correctly that getting the docIds for terms that are (almost) primary keys on a non optimized index will become a lot faster with this patch?
Yonik Seeley (@yonik) (migrated from JIRA)
Yes, if you are doing low level stuff directly on the MultiReader, like using TermEnum/TermDocs. Or calling Filter.getBits(multiReader). As you probably know, if you pass Filters or Queries to an IndexSearcher, it's now dropping down to the segment level already (a scorer is created per-segment) so it won't hit the MultiTermEnum code.
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch look good Yonik!
Yonik Seeley (@yonik) (migrated from JIRA)
Hmmm, unfortunately, this patch does nothing for RangeFilter.getDocIdSet(MultiReader), as the term enumerator used there is no longer a MultiTermEnum.
Uwe Schindler (@uschindler) (migrated from JIRA)
Really? It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader - so even when the filtered enum filters some terms, it should use the MultiTermEnum methods to iterate the terms.
Yonik Seeley (@yonik) (migrated from JIRA)
It's just a FilteredTermEnum on top of the TermEnum returned by MultiReader
Yes, but the MultiTermEnum has the internal knowledge about what segments matched the term, and actually has the sub-TermEnums positioned on that term.
Uwe Schindler (@uschindler) (migrated from JIRA)
Yes I understand, the problematic call is TermDocs.seek(TermEnum) which does an instanceof check. If it is a FilteredTermEnum or anything other, which is wrapped, the optimization is not used.
Yonik Seeley (@yonik) (migrated from JIRA)
Getting back to this... although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.
Thoughts/objections?
Michael McCandless (@mikemccand) (migrated from JIRA)
although this unfortunately won't currently help classes like RangeFilter used directly on a MultiReader because a MultiTermEnum is no longer used, I still think this is worth committing as-is for users of MultiTermEnum.
+1
It'd be great to somehow have the optimization apply to RangeFilter as well, though since Lucene has now moved to per-segment searching, it's low priority.
Yonik Seeley (@yonik) (migrated from JIRA)
Committed.
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm seeing this new AIOOBE when tracking down the intermittent failure in TestStressIndexing2:
1) testRandomIWReader(org.apache.lucene.index.TestStressIndexing2)
java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.lucene.index.MultiSegmentReader$MultiTermDocs.next(MultiSegmentReader.java:672)
at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:292)
at org.apache.lucene.index.TestStressIndexing2.verifyEquals(TestStressIndexing2.java:250)
at org.apache.lucene.index.TestStressIndexing2.testRandomIWReader(TestStressIndexing2.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:168)
at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.junit.internal.runners.CompositeRunner.runChildren(CompositeRunner.java:33)
at org.junit.internal.runners.CompositeRunner.run(CompositeRunner.java:28)
at org.junit.runner.JUnitCore.run(JUnitCore.java:130)
at org.junit.runner.JUnitCore.run(JUnitCore.java:109)
at org.junit.runner.JUnitCore.run(JUnitCore.java:100)
at org.junit.runner.JUnitCore.runMain(JUnitCore.java:81)
at org.junit.runner.JUnitCore.main(JUnitCore.java:44)
I think it's because this optimization isn't admissible in the case when one calls MultiTermDocs.seek on a MultiTermEnum derived from a different MultiSegmentReader. Ie, I think there needs to be another check that verifies in fact the MultiTermEnum passed to MultiTermDocs.seek share the same MultiSegmentReader?
Before this optimiztion, this was OK (only the term was used from the MultiTermEnum).
Yonik Seeley (@yonik) (migrated from JIRA)
Gah... I forgot it was permissible (or at least not disallowed) to pass an Enum not derived from the same reader. I'll fix.
Yonik Seeley (@yonik) (migrated from JIRA)
I just committed the fix (since trunk was broken) and a test that failed w/o the fix, but if anyone has a better idea how to handle/fix, we can certainly still discuss. I just did the obvious - store the multi-reader in the TermEnum and TermDocs instances and compare in TermDocs.seek(TermEnum)
Optimize MultiTermEnum and MultiTermDocs to avoid seeks on TermDocs that don't match the term.
Migrated from LUCENE-1596 by Yonik Seeley (@yonik), resolved May 15 2009 Attachments: LUCENE-1596.patch