apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Add auto-prefix terms to block tree terms dict [LUCENE-5879] #6941

Closed asfimport closed 9 years ago

asfimport commented 10 years ago

This cool idea to generalize numeric/trie fields came from Adrien:

Today, when we index a numeric field (LongField, etc.) we pre-compute (via NumericTokenStream) outside of indexer/codec which prefix terms should be indexed.

But this can be inefficient: you set a static precisionStep, and always add those prefix terms regardless of how the terms in the field are actually distributed. Yet typically in real world applications the terms have a non-random distribution.

So, it should be better if instead the terms dict decides where it makes sense to insert prefix terms, based on how dense the terms are in each region of term space.

This way we can speed up query time for both term (e.g. infix suggester) and numeric ranges, and it should let us use less index space and get faster range queries.

This would also mean that min/maxTerm for a numeric field would now be correct, vs today where the externally computed prefix terms are placed after the full precision terms, causing hairy code like NumericUtils.getMaxInt/Long. So optos like #6922 become feasible.

The terms dict can also do tricks not possible if you must live on top of its APIs, e.g. to handle the adversary/over-constrained case when a given prefix has too many terms following it but finer prefixes have too few (what block tree calls "floor term blocks").


Migrated from LUCENE-5879 by Michael McCandless (@mikemccand), 2 votes, resolved Apr 07 2015 Attachments: LUCENE-5879.patch (versions: 14) Linked issues:

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial work-in-progress patch: tests do NOT consistently pass, there are still bugs/corner cases (e.g. when an auto-prefix term is the last term in a block...).

This change very much requires #6332 (pull API for postings format) which I'd like to backport to 4.x along with this: it makes an initial pass through all terms to identify "good" prefixes, using the same algorithm block tree uses to assign terms to blocks, just with different block sizes. I haven't picked defaults yet, but e.g. you could state that an auto prefix term should expand to 100-200 terms and then the first pass picks prefix terms to handle that.

The problem is inherently over-constrained: a given set of prefixes like fooa*, foob*, fooc*, etc. may have too-few terms each, but then their common prefix foo* would have way too many. For this case it creates "floored" prefix terms, e.g. foo[a-e]*, foo[f-p]*, foo[q-z]\*.

On the 2nd pass, when it writes the actual terms, it inserts these auto-prefix terms at the right places.

Currently it only works for DOCS_ONLY fields, and it uses a FixedBitSet(maxDoc) when writing each prefix term.

These auto-prefix terms are fully hidden from all the normal Terms/Enum APIs, statistics, etc. They are only used in Terms.intersect, if you pass a new flag allowing them to be used.

I haven't done anything about the document / searching side of things: this is just a low level change at this point, for the terms dict. Maybe we need a new FieldType boolean "computeAutoPrefixTerms" or some such; currently it's just exposed as additional params to the block tree terms dict writer.

I think this would mean NumericRangeQuery/Filter can just rewrite to ordinary TermRangeQuery/Filter, and the numeric fields just become sugar for encoding their numeric values as sortable binary terms.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another iteration ... still tons of nocommits. I added Automata.makeBinaryInterval, so you can easily make an automaton that matches a min / max term range. I plan to switch the read-time API to use a new CompiledAutomaton.AUTOMATON_TYPE.RANGE, and fix e.g. PrefixQuery (and eventually maybe the infix suggester, and numeric range query) to use this API.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Mike for working on this, this is a very exciting issue! I'm very curious what the space/speed trade-off will look-like compared to static prefix encoding like NumericUtils does.

Maybe we need a new FieldType boolean "computeAutoPrefixTerms"

Why would it be needed? If the search APIs use intersect, this should be transparent?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe we need a new FieldType boolean "computeAutoPrefixTerms"

Why would it be needed? If the search APIs use intersect, this should be transparent?

I think being totally transparent would be the best solution! This would mean BT will always index auto-prefix terms for DOCS_ONLY fields ... I'll just have to test what the indexing time / disk usage cost is. If we need to make it optional at indexing time, I'm not sure what the API should look like to make it easy...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If the existing API is confusing, making it worse by adding a confusing boolean won't really fix the problem.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term?

Otherwise very cool!

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In fact, as far as I see: NumericRangeQuery with that would then just be a standard TermRangeQuery with a special binary upper/lower term?

I think so! We should be able to use NumericUtils.XToSortableInt/Long I think?

But that's phase 2 here... it's hard enough just getting these terms working low-level...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Latest patch: tests are passing, but still many nocommits ... I think net/net the approach is done and now I'm gonna wrap up / clean nocommits.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, resolving a few nocommits (but more remain!). Tests pass.

I created a simple benchmark, generating random longs over the full range, and random ranges as queries. For each test, I build the index once (10M longs), and then run all queries (10K random ranges) 10 times and record best time. I also print out the number of terms visited for the first query as a coarse measure of the prefix terms density.

NF is numeric field and AP is auto-prefix. For AP I just index the long values as 8 byte binary term, using the same logic in NumericUtils.longToPrefixCodedBytes to make the binary sort the same as the numeric sort.

Source code for benchmark is here:

https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/python/autoPrefixPerf.py

And here:

https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/main/perf/AutoPrefixPerf.java

Indexer uses single thread and SerialMergeScheduler to try to measure the added indexing cost terms writing.

Currently the auto-prefix terms are configured like term blocks in block tree, with a min/max number of items (terms, or "child" auto-prefix terms) per auto-prefix term.

Here are the raw results ... net/net it looks like AP can match NF's performance with a \~50% smaller index and slightly faster indexing time. The two approaches are quite different: NF uses TermsEnum.seek, but AP uses Terms.intersect which for block tree does no seeking.

However, there's a non-trivial indexing time & space cost vs the baseline... not sure we can/should just always enable this by default for DOCS_ONLY fields ...

Baseline (8-byte binary terms like AP)
    index sec 20.312 sec
    index MB 123.4

NF precStep=4
    index sec 225.06
    index MB 1275.70
    term count 560
    search msec 6056
NF precStep=8
    index sec 115.44
    index MB 675.55
    term count 2983
    search msec 5547
NF precStep=12
    index sec 80.77
    index MB 470.96
    term count 37405
    search msec 6080
NF precStep=16 (default)
    index sec 61.13
    index MB 363.19
    term count 125906
    search msec 10466
AP min=5 max=8
    index sec 194.21
    index MB 272.06
    term count 315
    search msec 5715
AP min=5 max=12
    index sec 179.86
    index MB 256.88
    term count 295
    search msec 5771
AP min=5 max=16
    index sec 168.91
    index MB 254.32
    term count 310
    search msec 5727
AP min=5 max=20
    index sec 157.48
    index MB 252.04
    term count 321
    search msec 5742
AP min=5 max=2147483647
    index sec 64.03
    index MB 164.55
    term count 3955
    search msec 6168
AP min=10 max=18
    index sec 106.52
    index MB 215.26
    term count 552
    search msec 5792
AP min=10 max=27
    index sec 99.00
    index MB 212.45
    term count 533
    search msec 5814
AP min=10 max=36
    index sec 88.45
    index MB 207.43
    term count 505
    search msec 5850
AP min=10 max=45
    index sec 79.15
    index MB 194.73
    term count 650
    search msec 5681
AP min=10 max=2147483647
    index sec 42.68
    index MB 162.64
    term count 6077
    search msec 6199
AP min=15 max=28
    index sec 84.83
    index MB 204.29
    term count 641
    search msec 5763
AP min=15 max=42
    index sec 74.20
    index MB 193.24
    term count 753
    search msec 5828
AP min=15 max=56
    index sec 63.69
    index MB 190.06
    term count 662
    search msec 5839
AP min=15 max=70
    index sec 62.53
    index MB 185.96
    term count 866
    search msec 5846
AP min=15 max=2147483647
    index sec 40.94
    index MB 162.52
    term count 6258
    search msec 6156
AP min=20 max=38
    index sec 69.26
    index MB 192.11
    term count 839
    search msec 5837
AP min=20 max=57
    index sec 60.75
    index MB 186.18
    term count 1034
    search msec 5877
AP min=20 max=76
    index sec 60.69
    index MB 185.21
    term count 980
    search msec 5866
AP min=20 max=95
    index sec 59.87
    index MB 184.20
    term count 985
    search msec 5940
AP min=20 max=2147483647
    index sec 41.64
    index MB 162.52
    term count 6258
    search msec 6196
AP min=25 max=48
    index sec 61.81
    index MB 187.08
    term count 806
    search msec 5790
AP min=25 max=72
    index sec 58.69
    index MB 183.02
    term count 929
    search msec 5894
AP min=25 max=96
    index sec 56.80
    index MB 178.29
    term count 841
    search msec 5938
AP min=25 max=120
    index sec 55.81
    index MB 177.75
    term count 1044
    search msec 5883
AP min=25 max=2147483647
    index sec 40.99
    index MB 162.52
    term count 6258
    search msec 6189
AP min=30 max=58
    index sec 56.99
    index MB 182.91
    term count 1012
    search msec 5891
AP min=30 max=87
    index sec 55.22
    index MB 178.16
    term count 1065
    search msec 5921
AP min=30 max=116
    index sec 54.78
    index MB 177.39
    term count 1142
    search msec 5811
AP min=30 max=145
    index sec 54.22
    index MB 176.64
    term count 1340
    search msec 5798
AP min=30 max=2147483647
    index sec 40.76
    index MB 162.44
    term count 6445
    search msec 6191
AP min=35 max=68
    index sec 56.75
    index MB 182.45
    term count 1265
    search msec 5403
AP min=35 max=102
    index sec 53.87
    index MB 177.99
    term count 1302
    search msec 5924
AP min=35 max=136
    index sec 51.10
    index MB 177.25
    term count 1547
    search msec 6082
AP min=35 max=170
    index sec 50.77
    index MB 176.93
    term count 1402
    search msec 5994
AP min=35 max=2147483647
    index sec 40.73
    index MB 161.75
    term count 8045
    search msec 6271
AP min=40 max=78
    index sec 55.75
    index MB 185.57
    term count 1499
    search msec 5917
AP min=40 max=117
    index sec 53.12
    index MB 180.57
    term count 1501
    search msec 5961
AP min=40 max=156
    index sec 50.93
    index MB 179.38
    term count 1806
    search msec 5487
AP min=40 max=195
    index sec 50.31
    index MB 178.81
    term count 1763
    search msec 5484
AP min=40 max=2147483647
    index sec 38.59
    index MB 158.98
    term count 12573
    search msec 6281
asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

The numbers look great, it's quite interesting to see the impact of the auto-prefix parameters on the number of visited terms!

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another patch, with all real nocommits removed (the only ones that remain are for DEBUG prints... I'll remove those before committing). I think this is close.

I picked a "gentle" (does not add too many auto-prefix terms) default of minItemsInPrefix=100 and maxItemsInPrefix=Integer.MAX_VALUE.

With these defaults, on the simple "random longs" test, the increase in index size for DOCS_ONLY fields is small (123.4 MB vs 137.69 MB: \~12%), much smaller than current default precStep=16 for long NumericField (363.19 MB vs 137.69 MB: \~62% smaller) and search time is faster (10.5 sec vs 6.0 sec: \~43% faster). Indexing time is also \~2X faster (29.96 sec vs 61.13 sec) than using default NumericField, but \~48% slower (20.31 sec vs 29.96 sec) than indexing simple binary terms.

I think it's OK to just turn this on by default for all DOCS_ONLY fields... apps can always disable by passing minItemsInPrefix=0 to Lucene41PF or BlockTreeTermsWriter if necessary.

Note that this optimization currently only kicks in for TermRangeQuery and PrefixQuery. It's also possible to enable it for certain wildcard (e.g. foo?x*) and regexp (e.g. foo[m-p]*) queries but this can wait.

There is plenty more to explore here in how auto-prefix terms are assigned, e.g. like the multi-level postings skip lists, we could have different thresholds for the "level 1" (the prefix matches real terms) skip lists vs "level 2+" (the prefix also matches other prefix terms), but we can iterate later / get feedback from users.

With this change an application can simply index numbers as binary terms (carefully flipping the sign bit so negative numbers sort before) or as fixed width terms (zero or space left-filled, e.g, 000042) and then run "ordinary" TermRangeQuery on them, and should see good improvements in index size, indexing time, search performance vs NumericRangeFilter.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, one problem here is ... with this patch we are inserting auto-prefix terms for numeric fields which is sort of silly (prefix terms of prefix terms). Not sure if/how to prevent that... I guess we could do something hacky and try to detect when writing the postings that it's a numeric field ... hmm.

I removed the DEBUG nocommits and ran a trunk vs patch perf test on all English wikipedia "medium" docs (33.3M) and it didn't look like there was any real change, just noise.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch: I added an option (default off) to FieldType for the application to "hint" that the field should be indexed for range querying/filtering if possible. It's experimental, only allowed if the field is DOCS_ONLY, and the javadocs explain that this is just a hint (postings format can ignore it entirely).

This value is propagated to "auto write-once schema" FieldInfo.getIndexRanges(), which has same semantics as omitNorms ("once off always off"). This means if you want to try out this new feature on a field you must fully re-index (or use FilterAtomicReader to reprocess your entire index).

Since this is now an "opt-in" feature, I also made the defaults more aggressive (25/48: same defaults for pinching off a block of terms, i.e. one prefix term per leaf term block).

At these defaults index is 52% larger for random longs (187.08 MB vs baseline 123.4) but 48 % smaller than default (precStep=16) numeric field, and search time is 45% faster (5790 msec vs 10466 with NF precStep=16). Indexing speed is about the same as NF...

Separately I also tested the "sequential IDs" case, indexing 100M ids in left-zero-prefixed sequential form (0001, 0002, ...); normally one wouldn't enable indexRanges for such a field (the "normal" use case is just PK lookup) but I still wanted to see how auto-prefix terms behave on such a "dense" term space:

Net/net I think sequential IDs / densely packed terms is an "easy" case since block tree can easily hit the max (default now 48) terms in each auto-prefix term. Also the postings compress very well since the docIDs are all adjacent (I indexed with a single thread).

Tests pass, "ant precommit" passes, I think it's ready.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can experimental please be added to indexablefieldtype and everywhere else?

I am afraid if even a single spot is missed, it could prevent a future refactoring or cleanup that removes this (which really in the future belongs as a codec parameter, but this is good for now).

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can experimental please be added to indexablefieldtype and everywhere else?

Woops I missed that one; I'll add it, and double check all the other places this boolean crawls to...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I plan to commit this soon to 5.0... we can iterate on the "phase 2" items (making it easier during indexing to note that this field will be used for range searching).

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

And how about NRQ? The same? Currently its unsupported, because we have to figure out how to auto-disable trie terms and how to make NRQ autorewrite to a simple TermRange if no trie terms available and instead only autoprefix terms...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We should think about a migration plan for numerics?

This should be a followup issue.

Here are some thoughts.

  1. keep current trie "Encoding" for terms, it just uses precision step=Inf and lets the term dictionary do it automatically.
  2. create a filteratomicreader, that for a previous trie encoded field, removes "fake" terms on merge.

Users could continue to use NumericRangeQuery just with the infinite precision step, and it will always work, just execute slower for old segments as it doesnt take advantage of the trie terms that are not yet merged away.

One issue to making it really nice, is that lucene doesnt know for sure that a field is numeric, so it cannot be "full-auto". Apps would have to use their schema or whatever to wrap with this reader in their merge policy.

Maybe we could provide some sugar for this, such as a wrapping merge policy that takes a list of field names that are numeric, or sugar to pass this to IWC in IndexUpgrader to force it, and so on.

I think its complicated enough for a followup issue though.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Wow, awesome work Mike! And fantastic idea Adrien!

I just read through the comments but don't have time to dig into the source yet.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Is the auto-prefixing done on a per-segment basis or is it something different that has to do with Codec internals? It appears to be the latter.

I asked that in a confusing way. I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like that plan Rob, except the term encoding that numeric field uses is somewhat wasteful: 1 byte used to encode the "shift", and then only 7 of 8 bits use for subsequent bytes ... but maybe we just live with that since it simplifies migration (both old and new can co-exist in one index).

I'll open an issue for this.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I created #7028 for the migration plan...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, awesome work Mike! And fantastic idea Adrien!

Thanks @dsmiley

I mean, are the intervals that are computed from the data determined and fixed within a given segment, or is it variable throughout the segment?

It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A, B, C* or A-G, H-P, ... etc. ? It appears this is applicable.

I don't quite understand the question ... the indexed terms can be any variable length.

Would any CompiledAutomaton (e.g. a wildcard query) that has a leading prefix benefit from this or is it strictly Prefix & Range queries? Mike's comments suggest it will sometime but not yet. Can you create an issue for it, Mike? This would be especially useful in Lucene-spatial; I'm excited at the prospects!

Currently auto-prefix terms are only used for PrefixQuery and TermRangeQuery, or for any automaton query that "becomes" a PrefixQuery on rewrite (e.g. WildcardQuery("foo*")).

Enabling them for WildcardQuery and RegexpQuery should be fairly easy, however they will only kick in in somewhat exotic situations, where there is a portion of the term space accepted by the automaton which "suddenly" accepts any suffix. E.g. foo*bar will never use auto-prefix terms, but foo?b* will.

I'll open an issue!

When you iterate a TermsEnum, will the prefix terms be exposed or is it internal to the Codec?

No, these auto-prefix terms are invisible in all APIs, except when you call Terms.intersect.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I opened #7029 to allow Wildcard/RegexpQuery to use auto-prefix terms...

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Is this applicable to variable-length String fields that you might want to do range queries on for whatever reason? Such as... A, B, C* or A-G, H-P, ... etc. ? It appears this is applicable.

I don't quite understand the question ... the indexed terms can be any variable length.

I simply mean is this useful outside of numeric fields?

Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I simply mean is this useful outside of numeric fields?

Oh, it's for any terms ... numeric or not.

Another question I have is, does the automatic prefix length calculation do it at byte boundaries or is it intra-byte?

It's currently byte-boundary only, though this is an impl detail and we could do e.g. nibbles in the future ...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, folding in feedback from Rob.

First off, nothing (!!) was testing that you could use one of the BOOLEAN_QUERY_REWRITE methods in MTQ when auto-prefix terms were enumeratd.

When I added that, it trips this assert in ScoringRewrite:

        assert reader.docFreq(term) == termStates[pos].docFreq();

This trips when the term returned by IntersectTermsEnum is an auto-prefix term, in which case reader.docFreq cannot find the term since auto-prefix terms are invisible to all APIs except intersect.

I fixed this by adding a hackish boolean isRealTerm to BlockTermState, and changed the assert to only check if it knows it's looking at real terms.

However, I also had to fix AssertingAtomicReader.AssertingTermsEnum, to override seekExact(BytesRef, TermState) and termState() to delegate to the delegate (in) instead of to super ... super will fail because it falls back to seeking by term, which cannot work here because you can't seek by an auto-prefix term.

This is a little scary ... because you are allowed to seekExact(BytesRef, TermState) for an auto-prefix term, meaning auto-prefix terms are in fact NOT invisible in this API. This is a possibly nasty trap, for any FilterAtomicReaders out there, that rely on super.... not sure what to do.

Also, Rob noticed that CheckIndex no longer checks all bits ... this is pretty bad ... I put a nocommit to think about what to do ...

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Some more questions:

It's per-segment, so each segment will look at how its terms fall and find "good" places to insert the auto-prefix terms.

So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense?

At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

Do you envision other posting formats being able to re-use the logic here? That would be nice.

In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

Its not that, its that there is absolutely no reason for making termsenum more complex in this way. it would accomplish nothing and would not make anything more efficient.

These apis really should be minimal

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense?

It's dynamic, based on how terms occupy the space.

Today (and we can change this: it's an impl. detail) it assigns prefixes just like it assigns terms to blocks. Ie, when it sees a given prefix matches 25 - 48 terms, it inserts a new auto-prefix term. That auto-prefix term replaces those 25 - 48 other terms with 1 term, and then the process "recurses", i.e. you can then have a shorter auto-prefix term matching 25 - 48 other normal and/or auto-prefix terms.

At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.

What this patch does is generalize/relax Terms.intersect: that method no longer guarantees that you see "true" terms. Rather, it now guarantees only that the docIDs you visit will be the same.

So to take advantage of it, you need to pass an Automaton to Terms.intersect and then not care about which terms you see, only the docIDs after visiting the DocsEnum across all terms it returned to you.

Do you envision other posting formats being able to re-use the logic here? That would be nice.

I agree it would be nice ... and the index-time logic that identifies the auto-prefix terms is quite standalone, so e.g. we could pull it out and have it "wrap" the incoming Fields to insert the auto-prefix terms. This way it's nearly transparent to any postings format ...

But the problem is, at search time, there's tricky logic in intersect() to use these prefix terms ... factoring that out so other formats can use it is trickier I think... though maybe we could fold it into the default Terms.intersect() impl...

In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.

I agree this (how auto-prefix terms are assigned) needs more control / experimenting. Really the best prefixes are a function not only of how the terms were distributed, but also of how queries will "likely" ask for ranges.

I think it's similar to postings skip lists, where we have different frequency of a skip pointer on the "leaf" level vs the "upper" skip levels.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, just resolving conflicts from recent trunk changes ...

I have an idea on improving CheckIndex to ferret out these auto prefix terms .... I'll explore it.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, enhancing CheckIndex to test term ranges of varying sizes (number of terms). This uncovered back compat bug in the patch, now fixed. This isn't guaranteed to touch all auto-prefix terms but it should touch most.

Net/net I think we should commit what we have here now, and get jenkins testing it. The new FieldType option is marked experimental (and has warnings in the javadocs) ... there are many things still to explore, and we need to sort out the migration from NumericField, but I think it's important to get this starting point to where users can try it out and then we iterate from there.

Tests pass, precommit passes.

I plan to commit soon...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But the problem is, at search time, there's tricky logic in intersect() to use these prefix terms ...

One approach we might be able to take here is to more strongly separate out the auto-prefix terms from the ordinary terms (e.g., so they feel like two different fields, in the impl), such that intersect becomes something like a merge between the postings format's "normal" intersect, with these auto-prefix terms pulling from a shadow field.

If we did this then other postings formats could also more easily add auto-prefix handling, and maybe even the default impl for Terms.intersect could use auto-prefix terms ...

But we can try this later; this is a big enough change already.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

One approach we might be able to take here is to more strongly separate out the auto-prefix terms from the ordinary terms (e.g., so they feel like two different fields, in the impl), such that intersect becomes something like a merge between the postings format's "normal" intersect, with these auto-prefix terms pulling from a shadow field.

If we did this then other postings formats could also more easily add auto-prefix handling, and maybe even the default impl for Terms.intersect could use auto-prefix terms ...

But we can try this later; this is a big enough change already.

I agree it can be deferred, but I am worried if it gets dropped completely.

Here, in order to make this easy to use, the feature is "temporarily" placed in fieldinfos with a warning. However, this is kind of screwed up when its currently a blocktree impl detail, and, well, its kinda complicated stuff to implement (no offense!). Realistically, if in 3 months, blocktree is still the only impl, how will people feel if I want to nuke this fieldinfo and fieldtype?

Will they be -1 that i break spatial/solr/whatever starts using it (possibly even their source code compile)? Because I think the following situations are ok:

Result #1

Result #2

But the following is not an ok "permanent situation": Result #3

How can we prevent this from happening?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks for reviewing Rob.

How can we prevent this from happening?

I think we shouldn't add the FI option at this time? We should only add it once we make it more generic so that all codec impls can easily support it?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good except for FixedBitSetTermsEnum. What is this doing? Can we remove it? I think its bogus how it does 'super(null)', its superclass should not even allow such a thing.

I agree it's abusing FilterTermsEnum ... I'll fix these FilterLeafReader.FilterXXX classes to barf if they get null.

We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ...

I'll fix it to directly subclass TermsEnum and override all methods...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We need this stub class because of the API the terms dict uses when asking the postings format to write one term's postings: we pass TermsEnum to PostingsWriterBase.writeTerm. This is e.g. for PF's that may want to iterate docs/positions multiple times when writing one term ...

I think thats the root of the problem causing my confusion? I guess TermsEnum is ok here, but its much more than "docsEnum that you can rewind" (and not obvious for that!). I think thats why i freaked out when i saw FixedBitSetTermsEnum passing null and only implementing one method, it just didnt make a lot of sense.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe we can somehow change the PostingsWriterBase API, so it only gets a "thingy that lets you pull docs/docsAndPositions as many times as you want" ... but I don't think that should block committing here?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think its enough to just add a comment to explain what is happening.

Its similar to seeing a "BitsetHashMap" or something. people are going to be very confused unless they have the 'rewind' explanation above.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I'll add a comment explaining it...

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we shouldn't add the FI option at this time?

New patch with FieldType.setIndexRanges removed, but I don't think we should commit this approach: the feature is basically so ridiculously expert to use that only like 3 people in the world will figure out how.

Sure, the servers built on top of Lucene can expose a simple API, since they "know" the schema and can open up a "index for range searching" boolean on a field and validate you are using a PF that supports that... but I don't think it's right/fair to make new, strong features of Lucene ridiculously hard to use by direct Lucene users.

It's wonderful Lucene has such pluggable codecs now, letting users explore all sorts of custom formats, etc., but the nasty downside of all this freedom is that new, complex features like this one, which offer powerful improvements to the default codec that 99% of Lucene users would have used, must either be implemented across the board for all codecs (a very tall order) in order to have an intuitive API, or must be exposed only via ridiculously expert codec-specific APIs.

I don't think either choice is acceptable.

So ... I tried exploring an uber helper/utility class, that lets you add "optimized for range/prefix" fields to docs, and "spies" on you to determine which fields should then use a customized PF, and then gives you sugar APIs to build range/prefix/equals queries... but even as best/simple as I can make this class it still feels way too weird/heavy/external/uncomittable.

Maybe we should just go back to the "always index auto-prefix terms on DOCS_ONLY fields" even though 1) I had to then choose "weaker" defaults (less added index size; less performance gains), and 2) it's a total waste to add such terms to NumericFields and probably spatial fields which already build their own prefixes outside of Lucene. This is not a great solution either...

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

but the nasty downside of all this freedom is that new, complex features like this one, which offer powerful improvements to the default codec that 99% of Lucene users would have used, must either be implemented across the board for all codecs (a very tall order) in order to have an intuitive API, or must be exposed only via ridiculously expert codec-specific APIs.

I don't think its a downside of the freedom, its just other problems.

However, there are way way too many experimental codecs. These are even more costly to maintain than backwards ones in some ways: they are rotated in all tests! For many recent changes I have spent just as much time fixing those as i have on backwards codecs. If we ever want to provide backwards compatibility for experimental codecs (which seems to confuse users constantly that we can't do this), then we have to tone them down anyway.

The existing trie-encoding is difficult to use, too. I dont think it should serve as your example for this feature. Remember that simple numeric range queries dont work with QP without the user doing subclassing, and numerics dont really work well with the parser at all because the analyzer is completely unaware of them (because, for some crazy reason, it is implemented as a tokenstream/special fields rather than being a more "ordinary" analysis chain integration).

The .document API is overengineered. I dont understand why it needs to be so complicated. Because of it taking on more than it can chew already, its impossible to even think about how it could work with the codec api: and I think this is causing a lot of your frustration.

The whole way that "lucene is schemaless" is fucking bogus, and only means that its on you, the user, to record and manage and track all this stuff yourself. Its no freedom to anyone, just pain. For example, we don't even know which fields have trie-encoded terms here, to do any kind of nice migration strategy from "old numerics" to this at all. Thats really sad and will cause users just more pain and confusion.

FieldInfo is a hardcore place to add an experimental option when we arent even sure how it should behave yet (e.g. should it really be limited to DOCS_ONLY? who knows?)

I can keep complaining too, we can rant about this stuff on this issue, but maybe you should commit what you have (yes, with the crappy hard-to-use codec option) so we can try to do something on another issue instead.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we can salvage parts of this patch by crumbling it up and breaking out separately committable pre-cursors ... I opened LUCENE-5879 to make it easier to index a single byte[] token.

After that maybe we could break out the CheckIndex improvements for testing the Codec's impl of .intersect, adding Automata.makeBinaryRange.

But there's still a huge gaping API hole, which is how the app can "easily"/"cleanly" notify Lucene that it intends to do range filtering on a given field: this is still a hard open question.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think for now we should simply commit auto-prefix as a dark feature (my last patch, which I hate!) ... progress not perfection.

I.e. it's available only as an optional, disabled by default, postings format, and if you want to use it in your app you must use PerFieldPF to enable it for certain fields. You'll have to figure out how to index byte[] tokens, etc.

I modernized the last patch (tons of stuff had changed) and carried over a fix for an issue I hit in #7067. Patch applies to trunk, and bumps block tree's index format.

With auto-prefix, numeric fields can simply be indexed as their obvious byte[] encoding and you just use TermRangeQuery at search time; TestAutoPrefix shows this.

There are maybe some problems with the patch: is it horrible to store CompiledAutomaton on PrefixQuery/TermRangeQuery because of query caching...? Should I instead recompute it for every segment in .getTermsEnum? Or store it on the weight? Hmm or in a shared attribute, like FuzzyQuery (what a hack)?

It allocates one FixedBitSet(maxDoc) at write time, per segment, to hold all docs matching each auto-prefix term ... maybe that's too costly? I could switch to more sparse impls (roaring, sparse, BitDocIdSet.Builder?) but I suspect typically we will require fairly dense bitsets anyway for the short prefixes. We end up OR'ing many terms together at write time...

I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package private under oal.index, so I can send the bit set to PostingsConsumer at write time. Maybe there's a cleaner way?

Maybe the changes should be moved to lucene/misc or lucene/codecs, not core? But this would mean yet another fork of block tree...

It only works for IndexOptions.DOCS fields; I think that's fine?

The added auto-prefix terms are not seen by normal postings APIs, they do not affect index stats, etc. They only kick in, as an implementation detail, when you call Terms.intersect(Automaton). The returned TermsEnum.term() can return an auto-prefix term, but

7000 improves this since we now use

MultiTermQueryConstantScoreWrapper by default.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are maybe some problems with the patch: is it horrible to store CompiledAutomaton on PrefixQuery/TermRangeQuery because of query caching...? Should I instead recompute it for every segment in .getTermsEnum? Or store it on the weight? Hmm or in a shared attribute, like FuzzyQuery (what a hack)?

Why would this be an issue? Its still index-independent, e.g. AutomatonQuery does this too.

It allocates one FixedBitSet(maxDoc) at write time, per segment, to hold all docs matching each auto-prefix term ... maybe that's too costly? I could switch to more sparse impls (roaring, sparse, BitDocIdSet.Builder?) but I suspect typically we will require fairly dense bitsets anyway for the short prefixes. We end up OR'ing many terms together at write time...

If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI).

It only works for IndexOptions.DOCS fields; I think that's fine?

Yes, I think so, since the user gets an exception if they screw this up.

I created a FixedBitPostingsEnum, FixedBitTermsEnum, both package private under oal.index, so I can send the bit set to PostingsConsumer at write time. Maybe there's a cleaner way?

I don't understand why these need to be tied to FixedBitSet. Seems like they can use the more generic Bitset api at least (nothing fixed-specific about them).

Maybe the changes should be moved to lucene/misc or lucene/codecs, not core? But this would mean yet another fork of block tree...

Alternatively, code could stay where it is. Lucene50PF wires zeros and doesnt have any options for now. in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something? This way when we feel comfortable, we can "expose" in the default index format by adding ctors there and removing the experimental one.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The added auto-prefix terms are not seen by normal postings APIs, they do not affect index stats, etc. They only kick in, as an implementation detail, when you call Terms.intersect(Automaton). The returned TermsEnum.term() can return an auto-prefix term, but

7000 improves this since we now use

MultiTermQueryConstantScoreWrapper by default.

I can't parse this. if you use a scoring rewrite, it still "works" right? Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can you fill in more detail about CompiledAutomaton.PREFIX?

I definitely understand the RANGE case, its difficult to make the equiv automaton. But for the PREFIX case, its trivial. Why not just make PrefixQuery subclass AutomatonQuery?

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Rob.

If you use BitDocIdSet.Builder, I think it works well either way. SparseFixedBitSet also has optimized or(DISI).

I coded this up, but then this Builder is angry because FreqProxDocsEnum throws UOE from its cost method, because this is actually not easy to implement (we don't track docFreq for each term inside IW's RAM buffer). I think for now we should just keep reusing a single FBS?

Looking @ BitDocIdSet.Builder it looks like it could do better here? E.g. use this cost as just an approximation, but then since it had to .or() the actual set bits, record that as the "true" cost?

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't understand why these need to be tied to FixedBitSet

I'll cutover to BitSet.

Alternatively, code could stay where it is

I'll leave it as is (dark addition to BlockTree).

in codecs/ we could have AutoPrefixPF that exposes it and make it experimental or something?

Good idea! I'll do this... this way Lucene50PF is unchanged.

I can't parse this. if you use a scoring rewrite, it still "works" right?

It does "work" (you get the right hits) ... TestPrefixQuery/TestTermRangeQuery randomly use SCORING_BOOLEAN_REWRITE and CONSTANT_SCORE_BOOLEAN_REWRITE.

Its just that the generated termqueries will contain pseudo-terms, but their statistics etc are all correct?

Right: they will use the auto-prefix terms, which have "correct" stats (i.e. docFreq is number of docs containing any term with this prefix). Is this too weird? It means you get different scores than you get today...

We could maybe turn off auto-prefix if you use these rewrite methods? But this would need an API change to Terms, e.g. a new boolean allowAutoPrefix to Terms.intersect.

I definitely understand the RANGE case, its difficult to make the equiv automaton.

It's not so bad; I added Operations.makeBinaryInterval in the patch for this. It's like the decimal ranges that Automata.makeInterval already does.

Why not just make PrefixQuery subclass AutomatonQuery?

I explored this, but it turns out to be tricky, for those PFs that don't have auto prefix terms (use block tree)...

I.e., with the patch as it is now, PFs like SimpleText will use a PrefixTermsEnum for PrefixQuery, but if I fix PrefixQuery to subclass AutomatonQuery (and remove AUTOMATON_TYPE.PREFIX) then SimpleText would use AutomatonTermsEnum (on a prefix automaton) which I think will be somewhat less efficient? Maybe it's not so bad in practice? ATE would realize it's in a "linear" part of the automaton...

Maybe we can somehow simplify things here ... I agree both PrefixQuery and TermRangeQuery should ideally just subclass AutomatonQuery.