apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Reduce size of FSTs due to use of direct-addressing encoding [LUCENE-8920] #9963

Closed asfimport closed 4 years ago

asfimport commented 5 years ago

Some data can lead to worst-case \~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:

I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding.

we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?


Migrated from LUCENE-8920 by Michael Sokolov (@msokolov), 1 vote, resolved Nov 15 2019 Attachments: TestTermsDictRamBytesUsed.java Linked issues:

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit d8b510bead86d4c6ec59063519894d207ee99d5e in lucene-solr's branch refs/heads/branch_8_2 from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d8b510b

LUCENE-8920: disable FST direct-addressing pending size reduction revert to FST version 6 removed CHANGES entry

asfimport commented 5 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

@msokolov - your revert on branch_8_2 seems to have broken most of the lucene/analysis/kuromoji tests with a common root cause...

  [junit4] ERROR   0.44s J0 | TestFactories.test <<<
   [junit4]    > Throwable #1: java.lang.ExceptionInInitializerError
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([B1B94D34D92CDA93:39ED72EE77D0B76B]:0)
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
   [junit4]    >    at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:215)
   [junit4]    >    at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:150)
   [junit4]    >    at org.apache.lucene.analysis.ja.JapaneseTokenizerFactory.create(JapaneseTokenizerFactory.java:82)
   [junit4]    >    at org.apache.lucene.analysis.ja.TestFactories$FactoryAnalyzer.createComponents(TestFactories.java:174)
   [junit4]    >    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:199)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:427)
   [junit4]    >    at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
   [junit4]    >    at org.apache.lucene.analysis.ja.TestFactories.doTestTokenizer(TestFactories.java:81)
   [junit4]    >    at org.apache.lucene.analysis.ja.TestFactories.test(TestFactories.java:60)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: Cannot load TokenInfoDictionary.
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:71)
   [junit4]    >    ... 46 more
   [junit4]    > Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource org.apache.lucene.store.InputStreamDataInput@5f0dbb2f): 7 (needs to be between 6 and 6)
   [junit4]    >    at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:216)
   [junit4]    >    at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:198)
   [junit4]    >    at org.apache.lucene.util.fst.FST.<init>(FST.java:275)
   [junit4]    >    at org.apache.lucene.util.fst.FST.<init>(FST.java:263)
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:47)
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:54)
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:32)
   [junit4]    >    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:69)
   [junit4]    >    ... 46 more

...perhaps due to "conflicting reverts" w/ #9950 / #9822 ? /cc @mocobeta

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

FYI: This is not related to my revert, I think a kuromoji dictionary data (o.a.l.a.ja.dict.TokenInfoDictionary$fst.dat) also should be reverted (regenerated) when FST version is reverted. This works for me.

$ ant -f lucene/analysis/kuromoji/build.xml regenerate
$ git status
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/dict/TokenInfoDictionary$fst.dat
$ ant -f lucene/analysis/kuromoji/build.xml test
BUILD SUCCESSFUL
Total time: 33 seconds
asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes, that makes sense. Because we reverted the "current version" in FST.java, we can no longer read FSTs created with the newer version, so we need to revert the dictionary file. I'll do that and run a full suite of tests just to make sure something else isn't still broken. Thanks for pointing this out, @hossman and finding the fix @mocobeta, and sorry for not being more careful with the "fix" the first time!

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Note: I pushed the old-format Kuromoji dictionary and it seems to have fixed the build

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Before digging in in earnest on FST size reduction, I'd like to tighten up the FST.Arc contract. Right now it has all public members and no methods to speak of, so the abstraction boundary is not well defined, and in fact we see consumers modifying Arc members in a few places outside of the FST class itself. This makes it more difficult to reason about the code and make provably valid changes. My plan is to do some nonfunctional commits:

  1. Add accessors (mostly getters, a few setters will be needed temporarily) to Arc, and make all of its members private. It seems as if we often write accessors with the same name as the member (rather than the bean standard), so I'll go with that.
  2. Eliminate the setters; this will require some light refactoring in FSTEnum, and a few changes to the memory codec, which keeps a list of Arcs locally and updates them for its own purposes.
  3. Some refactoring and general cleanup (tightening up access, whitespace fixes, etc)

Because that first step is going to touch a lot of files, keep it very strictly about introducing the accessors, so there won't be anything beyond changing things like arc.flags to arc.flags(), in a lot of places.

Once these changes are in, the fun can begin again :) I'll add Adrien's worst-case test and work on getting the size down for that, pursuing the ideas in the description.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This sounds good Mike. I'm making it a blocker for 8.3 since we haven't reverted from branch_8x.

asfimport commented 5 years ago

Mike Sokolov (@msokolov) (migrated from JIRA)

I'm making it a blocker for 8.3 since we haven't reverted from branch_8x.

Fair enough. I have the commits prepared for the refactoring steps, but I foolishly did them on a home pc that I cannot access r.n.; will post tonight.

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 760f2dbdcb29b993aab8f981d84ccbf2e20e9fa5 in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=760f2dbd

LUCENE-8920: encapsulate FST.Arc data

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit fe0c042470dc1a1ba7ffd27f91ac7bc96c3254a0 in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fe0c042

LUCENE-8920: remove Arc setters, moving implementations into Arc, or copying data into consumers

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 92d4e712d5d50d745c5a6c10dacda66198974116 in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=92d4e71

LUCENE-8920: refactor FST binary search

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit d1706b36babda2dc17fd82d3165607f5c44bd83e in lucene-solr's branch refs/heads/master from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d1706b3

LUCENE-8920: Fix bug preventing FST duplicate tails from being shared when encoded as array-with-gaps

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch refs/heads/master from Noble Paul https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5

LUCENE-8920: precommit errors

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@noblepaul thanks for fixing - I thought I had been watching the mailing list and would have seen any build fails from this, but somehow I did not!

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch refs/heads/jira/SOLR-13650 from Noble Paul https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5

LUCENE-8920: precommit errors

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit a929ac5dacba1f36c88f78412520ea1cd5f6ba1e in lucene-solr's branch refs/heads/SOLR-13105-visual from Noble Paul https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a929ac5

LUCENE-8920: precommit errors

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

@msokolov  There may be another option to speed-up FST arc lookup while limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons where we can fix L < log(N)/2 (to be faster than binary search), and would cost < (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the labels and store them in the array using the open-addressing idea: if a slot is occupied, try with the next block. If we can’t store a label in less than L tries, then abort and fallback to Label-List or Binary-Search option. At lookup we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant L), with at least 2x less memory than Direct-Addressing.

On the Binary-Search side, it could be possible to support variable length encoding, by finding the first byte starting a label based on the bit used to encode the var length additional bytes.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This is a cool idea.

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Based on some heuristics, Direct-Addressing is the good choice. For example if num labels / (max label - min label) >= 75%.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I like this! I would be happy to review if you want to post a patch. I may try eventually too if you don't get to it. I hope it should be a bit easier to try now that we have done some refactoring here, too. One potential complication is we are running out of bits to signal different encodings, but I think there should be one or two left?

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I'd love to work on that, but I'm pretty busy so I can't start immediately. If you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? Here in the thread or in an attached doc?

Some additional thoughts:

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

If I understand you correctly, T1 is the threshold we introduced earlier this year (or its inverse DIRECT_ARC_LOAD_FACTOR in fst.Builder). It's currently set to 4, or (1/4 as T1 in your formulation). There was pre-existing logic to decide (var-encoded) list vs. the (fixed-size, packed) array encoding; my change was piggy-backed on that. It's a threshold on N that depends on the depth in the FST. See FST.shouldExpand.

If you want to write up the open addressing idea in more detail, it's fine to add comments here unless you think they are too long / inconvenient to write in this form, then maybe attach a doc? I think that goes directly to the point of reducing space consumption, so this issue seems like a fine place for it.

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I tried a quick benchmark to evaluate the load factor (num labels / array size) we can expect for open-addressing. This gives us a good estimate of the space requirement.

This load factor depends on the number of tries (called L above) we accept - this defines the worst case perf. For each power-of-two array size, the benchmark attempts various L, and outputs the max load factor (above this load factor, the open-addressing aborts and we fallback to binary search) (benchmark print in next post below).

Outcome:

This means we have to expect a space increase of 2x (compared to binary search), for better perf than binary search (L = log(N)/2 + 1, which is the worst case, most of the time the open-addressing stops when encountering an empty slot before L).

 

To me this balance between space and performance cannot be hardcoded. This depends on the use-case. There should be a balance tuning parameter in the FST constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And based on this balance we could set the values of a couple of thresholds that define when to use direct-addressing, open-addressing, binary-search, compact-list.

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Open-addressing benchmark to store byte labels in an array of size between 8 and 256.

"Array" means array size.

"tries" means max number of slot comparisons allowed (when a slot we want to put a label to is not empty - move to the next slot).

"loadFactor" = max load factor (num labels / array size) for which we can add all the labels by respecting "tries". Above this max load factor, the open-addressing fails to store all labels and fallbacks to binary search. This max load factor is an average of the max load factor for 1000 random constructions done by the benchmark.

"memory" = space increase factor = 1 / loadFactor.

 

Array=8, tries=1, loadFactor=0.400125, memory x 2.499219 Array=8, tries=2, loadFactor=0.63825, memory x 1.5667841 Array=8, tries=3, loadFactor=0.78675, memory x 1.2710518 Array=12, tries=1, loadFactor=0.34391674, memory x 2.9076805 Array=12, tries=2, loadFactor=0.54183316, memory x 1.8455865 Array=12, tries=3, loadFactor=0.6964172, memory x 1.4359208 Array=16, tries=1, loadFactor=0.29825, memory x 3.352892 Array=16, tries=2, loadFactor=0.5039375, memory x 1.9843731 Array=16, tries=3, loadFactor=0.646625, memory x 1.5464914 Array=16, tries=4, loadFactor=0.7493125, memory x 1.3345567 Array=24, tries=1, loadFactor=0.25141662, memory x 3.9774618 Array=24, tries=2, loadFactor=0.44929162, memory x 2.225726 Array=24, tries=3, loadFactor=0.58083373, memory x 1.7216631 Array=24, tries=4, loadFactor=0.6867924, memory x 1.4560441 Array=32, tries=1, loadFactor=0.22528125, memory x 4.4388957 Array=32, tries=2, loadFactor=0.4059375, memory x 2.4634335 Array=32, tries=3, loadFactor=0.53625, memory x 1.8648019 Array=32, tries=4, loadFactor=0.63653123, memory x 1.5710148 Array=32, tries=5, loadFactor=0.7165625, memory x 1.3955517 Array=48, tries=1, loadFactor=0.18899995, memory x 5.2910066 Array=48, tries=2, loadFactor=0.36193773, memory x 2.762906 Array=48, tries=3, loadFactor=0.49154142, memory x 2.0344167 Array=48, tries=4, loadFactor=0.5871248, memory x 1.7032154 Array=48, tries=5, loadFactor=0.6700212, memory x 1.4924902 Array=64, tries=1, loadFactor=0.17101562, memory x 5.8474193 Array=64, tries=2, loadFactor=0.33707812, memory x 2.9666712 Array=64, tries=3, loadFactor=0.46217188, memory x 2.1636972 Array=64, tries=4, loadFactor=0.56409377, memory x 1.7727549 Array=64, tries=5, loadFactor=0.6392813, memory x 1.5642567 Array=64, tries=6, loadFactor=0.6963125, memory x 1.4361368 Array=96, tries=1, loadFactor=0.15294789, memory x 6.5381746 Array=96, tries=2, loadFactor=0.31204194, memory x 3.2046974 Array=96, tries=3, loadFactor=0.42829168, memory x 2.3348575 Array=96, tries=4, loadFactor=0.5277291, memory x 1.8949116 Array=96, tries=5, loadFactor=0.60920805, memory x 1.6414753 Array=96, tries=6, loadFactor=0.66278124, memory x 1.5087935 Array=128, tries=1, loadFactor=0.15130469, memory x 6.6091805 Array=128, tries=2, loadFactor=0.3054219, memory x 3.2741597 Array=128, tries=3, loadFactor=0.43253124, memory x 2.3119717 Array=128, tries=4, loadFactor=0.5283203, memory x 1.8927912 Array=128, tries=5, loadFactor=0.6043203, memory x 1.6547517 Array=128, tries=6, loadFactor=0.66267186, memory x 1.5090425 Array=128, tries=7, loadFactor=0.70567185, memory x 1.4170892 Array=192, tries=1, loadFactor=0.14355215, memory x 6.9661093 Array=192, tries=2, loadFactor=0.32579714, memory x 3.0693946 Array=192, tries=3, loadFactor=0.42888057, memory x 2.3316514 Array=192, tries=4, loadFactor=0.5238486, memory x 1.9089485 Array=192, tries=5, loadFactor=0.59191173, memory x 1.6894411 Array=192, tries=6, loadFactor=0.65411395, memory x 1.5287856 Array=192, tries=7, loadFactor=0.6975258, memory x 1.4336387 Array=256, tries=1, loadFactor=1.0, memory x 1.0 Array=256, tries=2, loadFactor=1.0, memory x 1.0 Array=256, tries=3, loadFactor=1.0, memory x 1.0 Array=256, tries=4, loadFactor=1.0, memory x 1.0 Array=256, tries=5, loadFactor=1.0, memory x 1.0 Array=256, tries=6, loadFactor=1.0, memory x 1.0 Array=256, tries=7, loadFactor=1.0, memory x 1.0 Array=256, tries=8, loadFactor=1.0, memory x 1.0

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

This is cool. Regarding the strategy for which encoding to apply, I'll just call out the current  heuristics:

doFixedArray = node depth (distance from root node) <= 3 and N >= 5) or N >= 10

writeDirectly = doFixedArray && (max_label - min_label) < 4 * N

 

I think we can still consider that we would apply list-encoding for small N, and consider open addressing as a variant within "doFixedArray," where we now can choose among direct addressing (for least load factors L), open addressing (for intermediate case), and binary search for highest L. Does that sound right?

 

I wonder if we could work backwards from a single parameter L: maximum memory cost (vs list encoding). The API would guarantee that no set of arcs is ever encoded using more than L * the minimum possible, and then internally we choose the best (ie fastest lookup) encoding that achieves that, possibly with some tweak for higher-order arcs (ie near the root).

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

list-encoding for small N, and consider open addressing as a variant within "doFixedArray"

Yes, sounds right.

single parameter: maximum memory cost [...] never encoded using more than L * the min possible

The min space possible is obtained with every node transitions encoded as a list with variable-length encoding. I don't see a way to determine the min space possible (except by encoding it, which would cost too much for the benefit) to give guarantees. It is however possible to compare to the "fixed array" encoding, but in this case we don't have a clear picture to give in the API contract.

I like the idea of a single parameter to work with. That's why I was thinking of a slider between most-compact to most-performant. In this case we don't give guarantees but best effort intent.

My next step is to work on the heuristics. Once we have that clear, we can start implementation.

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Here is a proposal for the heuristic to select the encoding of a FST node.

The idea is to have threshold values that vary according to a parameter, let's call it "timeSpaceBalance". timeSpaceBalance can have 4 values: MORE_COMPACT, COMPACT (default, current FST), FAST, FASTER. Keep the current FST encoding/behavior for COMPACT. Only try open-addressing encoding for the FAST or FASTER balance. Be very demanding for direct-addressing for COMPACT or MORE_COMPACT. Do we need the MORE_COMPACT mode? Even if I don't see the use-case now, since it's easy and does not involve more code to have it, I would say yes.

4 rules, one per possible encoding, ordered from top to bottom. The first encoding which condition matches is selected.

n: number of labels (num sub-nodes) depth: depth of the node in the tree

[list-encoding] if n <= L1 || (depth >= L2 && n <= L3)

[direct-addressing] if n / (max label - min label) >= D1

[try open-addressing] if depth <= O1 || n >= O2

[binary search] otherwise

And below are the threshold values for each timeSpaceBalance:

timeSpaceBalance = MORE_COMPACT (memory < x1) L1 = 6, L2 = 4, L3 = 11 D1 = 0.8 O1 = -1, O2 = infinite

timeSpaceBalance = COMPACT (memory x1) L1 = 4, L2 = 4, L3 = 9 D1 = 0.66 O1 = -1, O2 = infinite

timeSpaceBalance = FAST (memory x2 ?) L1 = 4, L2 = 4, L3 = 7 D1 = 0.5 O1 = 3, O2 = 10

timeSpaceBalance = FASTER (memory x3 ?) L1 = 3, L2 = 4, L3 = 5 D1 = 0.33 O1 = infinite, O2 = 0

Thoughts?

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Here is a proposal for the heuristic to select the encoding of a FST node.

I like the overall structure of the proposal. I'm unsure about the proposed levels. For example, I believe the current FST does not have D1=0.66 rather it is 0.25. I'm not saying that's the right choice, merely that I think it is what we have on master, and the discrepancy here makes me wonder if I'm reading the proposal correctly. How did you come up with the 0.66 number?

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I believe the current FST does not have D1=0.66 rather it is 0.25. [...] How did you come up with the 0.66 number?

Right. You read correctly, I indeed mean 0.66 instead of 0.25.

I started my reflection based on the description of this Jira issue "Some data can lead to worst-case \~4x RAM usage due to this optimization". Actually is 4x RAM an issue if we gain perf? It depends on the use-case. If the use-case is to have a FST as usual with same perf as before direct-addressing optim, then yes 4x RAM is an issue. If the use-case is to focus on improving perf, then no, or maybe we can optimize a bit to have memory x3 or x2.

Given the proposal to have 4 levels of timeSpaceBalance, I think the level COMPACT should not risk x4 RAM. I think this x4 factor is due to the worst case memory overhead of storing n direct-addressing labels in a n x 4 (or n / 0.25) array. So if we ensure a worst case of n x 1.5 (or n / 0.66) then we can say we have the FST "as before" (before direct-addressing optim), while still benefiting from direct-addressing perf in some cases.

Actually I think 0.25 factor for direct-addressing pays too much memory for perf improvement compared to open-addressing. I think 0.5 factor (x2 memory) is clearly better than open-addressing, and 0.33 factor is good if we accept x3 memory. That's why I set these values for respectively FAST and FASTER levels.

 

asfimport commented 5 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I should invert D1 for clarity, as you did:

[direct-addressing] if (max label - min label) / n <= D1

with

timeSpaceBalance = MORE_COMPACT (memory < x1) D1 = 1.25

timeSpaceBalance = COMPACT (memory x1) D1 = 1.5

timeSpaceBalance = FAST (memory x2 ?) D1 = 2

timeSpaceBalance = FASTER (memory x3 ?) D1 = 3

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

OK, I just wanted to make sure we were talking apples/apples. To get specific values for these parameters, it would be really helpful to measure the impacts on some typical workloads. Roughly the numbers you're proposing seem sensible.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I'm starting to work on the implementation today, to give a try. I'll start by ramping up on the FST class.

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

You might want to start with a bit of cleanup/refactoring that makes what you're doing easier. Both from my own experience looking through the FST code and from seeing Michael Sokolov's input on the code readability (Dawid too?)... I wish the code were easier to read.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Good advice. I'll still first start to ramp up, and maybe propose clean up before the effective PR.

When reading the code, I found a comment pointing to a previous discussion around this subject: 

5747.

This discussion focused on reducing the FST size waste in the case of fixed arrays. I think it is still relevant, and maybe easier to tackle now that we are ready to have a timeSpaceBalance parameter. I'll try to include it in the work.

asfimport commented 4 years ago

Ignacio Vera (@iverase) (migrated from JIRA)

With the upcoming release of Lucene 8.3.0, this issue is still creating memory issues:

 

https://elasticsearch-benchmarks.elastic.co/#tracks/geopoint/nightly/default/30d

 

Should this change be reverted from branch_8_3 and branch_8x until we find a better solution?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Previous report was of a 3-4x increase - I think what that chart shows is a 50% increase, which is not unexpected from this change. The whole idea is a space/time tradeoff, so we do expect to see increased memory usage from this. Is that seen as a blocker? Can you explain a bit more about the use case in the ES test results - what do they represent?

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@msokolov The 3x-4x increase was me trying to reason about the worst-case scenario. Fortunately in practice the increase in memory usage is lower, like in this example. +50% is still significant and I know many Elasticsearch clusters whose memory usage is mostly driven by the terms index, so not addressing this memory usage increase would make Lucene 8.3 a challenging upgrade for us.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Can we simply change the constant fst.Builder.DIRECT_ARC_LOAD_FACTOR to 1.5f (making it float)?

This should clearly reduce the memory impact (and at the same time the performance benefit also).

BTW what is the performance benefit with current fst.Builder.DIRECT_ARC_LOAD_FACTOR = 4?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Can we simply change the constant fst.Builder.DIRECT_ARC_LOAD_FACTOR to 1.5f (making it float)?

That seems fine as an interim measure. It's hard to know what the impact would be on the ES benchmarks though.

I saw performance improvements primarily in Suggesters which are heavy users of FSTs; numbers are posted in #9825

 

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Changing the constant would work for me, I just wonder that it would maybe be easier to revert in order to have to deal with fewer version numbers of the FST class in the future.

Maybe another way we could fix the worst-case memory usage while keeping the improved runtime would be to have the factor depend on how deep we are in the FST since this change is more useful on frequently accessed nodes, which are likely the nodes that are closer to the root?

I wouldn't want to hold the release too long because of this change so I'm suggesting reverting from all branches on Monday, and we can work on some of the options that have been mentioned above to keep the worst-case scenario more contained. Any objections?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Fine by me. I find it too difficult to iterate on a more refined solution given limited access to the benchmarking tools we are using for evaluation.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Ah sorry it was not clear to me this was blocking you. I should be able to make a standalone test that reproduces the memory usage increase.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I think you had previously created a test case for this, @jpountz that demonstrated some larger memory usage than we wanted. I was referring to the fact that it was somewhat artificial data distribution, and the main issue that seems to arise is some regression tests at ES that may? have a more realistic distribution of terms. I'm just not convinced that we need to handle every adversarial case?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

For posterity, this is the worst case test that spreads out terms {{ for (int i = 0; i < 1000000; ++i) {   byte[] b = new byte[5];   random().nextBytes(b);   for (int j = 0; j < b.length; ++j){     b[j] &= 0xfc; // make this byte a multiple of 4   }  entries.add(new BytesRef(b)); } buildFST(entries).ramBytesUsed();}}

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Right, this is what I had in mind, trying to reproduce the issue with values that look more real.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@msokolov I added a test case that simulates indexing with Flake IDs. I tuned some parameters so that the memory usage increase would be close to the worst-case scenario, but even random realistic parameters may trigger a significant increase. With this test case, I get 5257801 bytes used for terms when direct writing is disabled, and 13777538 (more than 2x more) when it is enabled.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Update about my try with open-addressing.

In fact open-addressing is not compatible with FST contract. Indeed FST not only needs a mapping label-to-value, but also requires the ordering of label-value entries (for FSTEnum for example to get the next arc). Open-addressing does not keep the ordering. Dead-end, I missed that point.

So I'd like to continue helping on this, by leveraging @jpountz's test and trying to find a heuristic that takes advantage of direct-addressing while limiting the memory overhead.

Medium term I'd like to explore another option: allow binary search on variable length list. If we can do this efficiently, maybe we could reduce the memory and still have good perf, potentially to buy memory to use more direct-addressing. The idea is to encode differently the VInt/VLong specially for FST key-value entries. Currently VInt encoding uses a bit to tell if there is a next additional byte. We could use this bit instead to flag the label bytes (let's say this bit is 0), and to flag differently the value bytes (this bit is 1). Let's call it keyflag bit. So the new VInt-for-FST encoding is a series of bytes having the same keyflag bit until this bit changes (and the last byte read is ignored). This does not require more memory, and this allows binary search: jump to the middle of the whole byte range of the node (indistinctly keys and values), there read the keyflag and move next/previous until the first byte of a key is found, then read the VInt-for-FST. And so on like a binary search. There is a little byte reading overhead compared to fixed length binary search, but it is more compact so maybe more IO cache hit on average. So perf to be evaluated.

@mikemccand what is your opinion on the above idea? Could it fix #5747 if it shows good enough perf?

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Open-addressing does not keep the ordering. Dead-end, I missed that point.

Maybe we could make some adjustments to allow it to store data in order, e.g. by using a hash function that preserves order like division and expanding the backing array instead of going back to its beginning when handling collisions at the end of the array?

Since this problem looks very similar to the encoding of postings (provide random-access into a sequence of increasing integers), maybe we should also consider some encoding techniques like Elias Fano.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 81f598c2e6731fdd7d615eceb679dfc849c5446d in lucene-solr's branch refs/heads/master from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=81f598c

LUCENE-8920: Disable direct addressing of arcs. (#950)

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit da73a6aff511b85df306f8066c2d7d7b6453bc77 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=da73a6a

LUCENE-8920: Disable direct addressing of arcs. (#950)