apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Add auto-prefix terms to block tree terms dict [LUCENE-5879] #6941

Closed asfimport closed 9 years ago

asfimport commented 10 years ago

This cool idea to generalize numeric/trie fields came from Adrien:

Today, when we index a numeric field (LongField, etc.) we pre-compute (via NumericTokenStream) outside of indexer/codec which prefix terms should be indexed.

But this can be inefficient: you set a static precisionStep, and always add those prefix terms regardless of how the terms in the field are actually distributed. Yet typically in real world applications the terms have a non-random distribution.

So, it should be better if instead the terms dict decides where it makes sense to insert prefix terms, based on how dense the terms are in each region of term space.

This way we can speed up query time for both term (e.g. infix suggester) and numeric ranges, and it should let us use less index space and get faster range queries.

This would also mean that min/maxTerm for a numeric field would now be correct, vs today where the externally computed prefix terms are placed after the full precision terms, causing hairy code like NumericUtils.getMaxInt/Long. So optos like #6922 become feasible.

The terms dict can also do tricks not possible if you must live on top of its APIs, e.g. to handle the adversary/over-constrained case when a given prefix has too many terms following it but finer prefixes have too few (what block tree calls "floor term blocks").


Migrated from LUCENE-5879 by Michael McCandless (@mikemccand), 2 votes, resolved Apr 07 2015 Attachments: LUCENE-5879.patch (versions: 14) Linked issues:

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I.e., with the patch as it is now, PFs like SimpleText will use a PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText would use AutomatonTermsEnum (on a prefix automaton) which I think will be somewhat less efficient? Maybe it's not so bad in practice? ATE would realize it's in a "linear" part of the automaton...

We cannot continue writing code in this way.

Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We cannot continue writing code in this way.

Please let intersect take care of how to intersect and get this shit out of the Query. The default Terms.intersect() method can specialize the PREFIX case with a PrefixTermsEnum if it is faster.

Can you maybe be more specific? I'm having trouble following exactly what you're objecting to.

Terms.intersect default impl is already specializing to PrefixTermsEnum in the patch.

You don't want the added ctor that takes a prefix term in CompiledAutomaton but you are OK with PREFIX/RANGE in CA.AUTOMATON_TYPE?

If I 1) remove the added ctor that takes the prefix term in CA, and 2) fix PrefixQuery to subclass AutomatonQuery (meaning CA must "autodetect" when it receives a prefix automaton), would that address your concerns? Or something else...?

I still wonder if just using AutomatonTermsEnum for prefix/range will be fine. Then we don't need PREFIX nor RANGE in CA.AUTOMATON_TYPE.

I'll open a separate issue for this...

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, everything you propose makes sense (especially the last point you make, that would be fantastic!)

High level I just feel here that we have the second use case where codec can do "special" stuff with intersect and we should be removing these specializations in our code, and just be passing the structure to the codec. I do realize this is already messy in trunk, but I think we need to remove a lot of this complexity.

At the very least I think PrefixQuery shouldn't be a "backdoor automaton query" :)

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, folding in changes after #7427, and also cutting over TermRangeQuery to AutomatonQuery.

Now the changes to CompiledAutomaton are minimal, just the addition of the sinkState.

I have a few minor nocommits left ... otherwise I think this is ready.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, fixing the nocommits. I think it's ready ... I'll beast tests for a while on it.

I don't think we should rush this into 5.1.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to the patch

I don't think we should rush this into 5.1.

+1

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We should be able to add a trivial test to lucene/codecs that extends BasePostingsFormatTestCase for this new PF ? What about "putting it into rotation" in RandomCodec? These things would give us a lot of testing.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 to the patch. I can add the tests later if you want. but they should be trivial.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another iteration, adding a test case as Rob suggested. It was somewhat tricky because BasePFTestCase tests all IndexOptions but this new PF only supports DOCS.

So, I factored out a RandomPostingsTester (in test-framework) from BasePFTestCase, which lets you specify which IndexOptions to test, and then added TestAutoPrefixPF to use that.

This process managed to find a couple latent bugs in BasePostingsFormatTestCase's SeedPostings!

Tests seem to pass ... I ran distributed beasting for 17 iters.

I think it's ready.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1

I thought those tests were gonna be easy... but the refactoring of the test is great. Thanks!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1670918 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1670918

LUCENE-5879: add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I committed to trunk ... I'll let it bake a bit before backporting to 5.x (5.2).

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1670923 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1670923

LUCENE-5879: fix test compilation (this enum no longer exists)

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops ... the "term range checking" I added to CheckIndex here is way, way too costly: https://people.apache.org/\~mikemccand/lucenebench/checkIndexTime.html

That's on an index that has zero auto-prefix terms ...

I'll turn this off for now and mull how to fix it. We could at least skip this when that field has no auto-prefix terms, but it will still be costly when the field does have auto-prefix terms because it does a rather exhaustive comparison of the "full" term space iteration (does not use auto-prefix terms) against the Terms.intersect one (does use them)....

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1671380 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1671380

LUCENE-5879: turn off too-slow term range checking for now

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'll backport this to 5.2 soon ... trunk jenkins hasn't uncovered any serious issues so far ...

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1671765 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1671765

LUCENE-5879: add auto-prefix terms to block tree, and experimental AutoPrefixTermsPostingsFormat

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1671766 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1671766

LUCENE-5879: move CHANGES entry to 5.2.0

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672037 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672037

LUCENE-5879: fix corner case in auto-prefix intersect

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672042 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1672042

LUCENE-5879: fix corner case in auto-prefix intersect

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672262 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1672262

LUCENE-5879: fix empty string corner case

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672263 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672263

LUCENE-5879: fix empty string corner case

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1672749 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1672749

LUCENE-5879: turn off too-slow term range checking for now

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1673009 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1673009

LUCENE-5879: fix finite case of Automata.makeBinaryIterval, improve tests

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1673075 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1673075

LUCENE-5879: fix ob1 that caused OOME in test when min and max auto-prefix terms was 2; attempt to simplify empty string case

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1674027 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1674027

LUCENE-5879: fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1674029 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1674029

LUCENE-5879: fix test bug: we cannot enforce max term count for empty-string prefix query since we [intentionally] do not create an empty-string auto-prefix term at index time

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close for 5.2.0.