Wrapup flexible indexing [LUCENE-2111]

asfimport commented 14 years ago

Spinoff from #2532.

The flex branch is in fairly good shape – all tests pass, initial search performance testing looks good, it survived several visits from the Unicode policeman ;)

But it still has a number of nocommits, could use some more scrutiny especially on the "emulate old API on flex index" and vice/versa code paths, and still needs some more performance testing. I'll do these under this issue, and we should open separate issues for other self contained fixes.

The end is in sight!

Migrated from LUCENE-2111 by Michael McCandless (@mikemccand), resolved Apr 13 2010 Attachments: benchUtil.py, flex_backwards_merge_912395.patch, flex_merge_916543.patch, flexBench.py, LUCENE-2111_bytesRef.patch, LUCENE-2111_experimental.patch, LUCENE-2111_fuzzy.patch, LUCENE-2111_mtqNull.patch, LUCENE-2111_mtqTest.patch, LUCENE-2111_toString.patch, LUCENE-2111.patch (versions: 21), LUCENE-2111-EmptyTermsEnum.patch (versions: 2), Make2BTermsIndex.java Linked issues:

3217
- 3185
- 3186
- 2680

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think net/net we are good to land flex!

+1. The tests have been passing for some time now, and Solr tests pass too.

It would be nice to look at merging flex into the trunk soon so that it gets more exposure.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Small fixes for flex – fixes SpanTermQuerty to throw exc if it's run on a field that omitTFAPs (matches PhraseQuery), fixes all jdoc warnings, spells out back compat breaks in changes.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Flex is generally faster.

Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard?

How's the indexing performance?

I think net/net we are good to land flex!

+1! Even if there are still small things to change/fix I think it makes sense to merge with trunk now.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are certain specific wildcard corner cases where we are slower, but these are likely rarely used in practice (many ?'s followed by a suffix).

I think it would be good to fix this in the future, but I certainly think its a rare case. The problem is similar to where an SQL engine decides to just table-scan instead of using a btree index... In this case we are trying to be too smart and just seek to the correct term based on the query instead of scanning, but this causes too many seeks.

At the same time, you have to be careful or you make the wrong decision and give O(n) performance instead of O(log n).

In my opinion it would be better to think in the future how we can improve lucene in the following ways:

The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong, linear enumeration of terms is inefficient for any big index. we should get away from it.
Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them, we provide a DFA and enumerate the transduced results.
We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much complication and unnecessary hairy code today.

All this being said, I think flex is a great move forward for multitermqueries, at least we have a seeking-friendly API! One step at a time.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Awesome work! What changes make those queries run faster with the default codec? Mostly terms dict changes and automaton for fuzzy/wildcard?

The AutomatonQuery (for fuzzy/wildcard) gives the biggest gains :) Other MTQs (prefix) see gains I think because of more efficient terms enum. The TermQuery speedup surprises me – that can't be a terms dict thing (just one lookup); i'm not sure offhand why it's faster. That code is not very different than trunk.

How's the indexing performance?

Unchanged – I indexed first 10M docs of wikipedia and the times were nearly identical.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The term dictionary should be more "DFA-friendly", e.g. the whole concept of TermsEnum is wrong, linear enumeration of terms is inefficient for any big index. we should get away from it. Instead it would be nice to think of the index like an FST, and instead of enumerating things and filtering them, we provide a DFA and enumerate the transduced results. We need to eliminate the UTF-8/UTF-16 impedence mismatch which causes so much complication and unnecessary hairy code today.

+1 – we already see these limitations now in making AutomatonQuery consume the straight enum. If we flipped the problem around (you pass a DFA to the codec and it does the intersection & enums the result), and we used byte-based DFAs, I think we'd get a good speedup.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Flex has some trouble making > 2B terms – attached patch creates such an index. I'm still getting to the bottom of it...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch fixes standard codec's terms dict to handle > 2B terms; I also strengthened CheckIndex to verify that .ord() of the TermsEnum always returns the right result, for codecs that implement .ord. I'll commit shortly...

apache / lucene

Wrapup flexible indexing [LUCENE-2111] #3187

3217

3185

3186

2680