Reduce size of FSTs due to use of direct-addressing encoding [LUCENE-8920]

asfimport commented 5 years ago

Some data can lead to worst-case \~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:

I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding.

we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?

Migrated from LUCENE-8920 by Michael Sokolov (@msokolov), 1 vote, resolved Nov 15 2019 Attachments: TestTermsDictRamBytesUsed.java Linked issues:

10326

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I reverted the change until we can better handle the worst-case scenario, and removed the blocker label.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit f023961ded315077921a30ac032763ec60e6bb26 in lucene-solr's branch refs/heads/branch_8_3 from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f023961

LUCENE-8920: Disable direct addressing of arcs. (#950)

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

store data in order, e.g. by using a hash function that preserves order

The hash function must have a good distribution otherwise there are too many collisions and this defeats the performance gain of hash mapping. I don't know a hash function that preserve order and still has good distribution.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

maybe we should also consider some encoding techniques like Elias Fano

@jpountz this is a nice idea. So the labels could be encoded in a succinct list with random access. What about the values? If they were positive and monotonic, they could indeed be Elias Fano encoded also, and a binary search on the label could provide the index of the corresponding value. In the generic case FST values are Outputs, so no specific property. In the case of PositiveIntOutputs, the javadoc says positive value, but nothing seems to ensure they are monotonic. But maybe in practice the values are monotonic for a FST node of BlockTree?

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I was thinking we could store outputs in a parallel array, ordered by their associated label. When looking up a label, the FST could give us the index of the associated output?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> store outputs in a parallel array

This could save quite a bit when encoding array-style since then each Arc would only have flags + a single int in the graph part of the data structure, which is where the gaps are bloating things, removing output (although many arcs have no output) - can we also offload target and label into this sidecar array? Indeed, if we can make the Arc size constant it will greatly simplify the construction logic (no need for two passes to "inflate" smaller Arcs to full size. We would be adding the array reference, so there is some tradeoff, but I bet it would be net positive.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I have added PR #980 to reduce the memory used by direct-addressing of arcs - no commit, just discuss at this point.

When testing with the "worst case test" above (see TestFST.testWorstCaseForDirectAddressing() in the PR), I get ramBytesUsed=5.24 MB with direct-addressing without optimization, and I get 4.06 MB with direct-addressing with optimization. Given that without using direct-addressing at all I get 4.03 MB, and without using fixed-array I get 3.68 MB. (tests.seed=6450920EFF5276F0, see the nocommit constant helpers I added at the top of FST)

So this optimization seems to get rid of the worst case, and provide nice memory reduction.

The idea is the following. We reduce the memory by using "arc presence" bits, one bit per arc. The node only stores the present arcs (no "missing arc" anymore). The offset to a given arc is computed by getting the number of "presence bits" set up to the corresponding bit for the arc (and multiplied by the number of bytes per arc). So the storage is quite compact (only 1 bit per arc in addition to fixed-array storage), and the performance to random access to an arc is very good (count bits sets up to an index, in an array of long - below the nanosecond). And in addition, accessing the next arc is faster because we don't have to skip "missing arc" as there is no "missing arc" anymore.

I plan to clean a bit the code, and implement the logic to select better each node encoding (now I understand your point @msokolov about comparing the memory usage to the compact-list encoding, now that I have fully read the code). Ideally I would like to have FST refactored because it becomes too complex. It would deserve some variant classes for the arcs for example. But for the moment I prefer to continue on the algorithm/data structure, and probably try to optimize also the fixed-array structure.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I have pushed more commits to PR#980 to clean the code and always have the presence bits in the direct-addressing code. This time it is ready to finalize the review, no more nocommit.

I ran some benchmarks to compare before and after the presence bits optimization:

Previous heuristic (direct-addressing if labelRange < 4 * numArcs) made only 1.3% of the fixed array nodes with direct addressing, while having memory issue specially for worst cases.

New heuristic (direct-addressing if sizeWithDirectAddressing <= 2.3 x sizeWithFixedArray) makes 47.6% of the fixed array nodes with direct addressing, while keeping the overall FST memory increase at 23%. So this should both improve performance and control memory. And I think there is no need for an additional parameter in the FST constructor.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Maybe we should update the naming with your proposed approach, e.g. "indexed" or something along those lines instead of "direct addressing" which suggests nodes are directly addressed by their label like in Mike's previous patch?

Out of curiosity have you looked into no longer explicitly serializing labels with this new approach since labels can be inferred from the bitset?

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Maybe we should update the naming with your proposed approach

In “direct-addressing” I like the “direct” because the goal of this structure is to access an arc directly in constant time (not linear or logarithmic). With this approach we still access the arcs based on their label within the range but indirectly. “indexed” seems too general to me, maybe because I’m thinking of another use of bitTable for fixed array binary search. I’m ok to keep direct addressing, or another option for example “range addressing”.

no longer explicitly serializing labels

Initially I thought we could remove labels to save memory. Indeed we can infer the label from the index in the range (this was already the case with the previous direct-addressing patch). But the label per arc seems still required to support FST.readNextArcLAbel() which peeks the label of the next arc without changing the current arc (so in this case we don’t have the bitTable or label range metadata of the next arc). This method is used by FSTEnum to seekFloor.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Wouldn't the bitTable be the same as the bitTable of the current arc?

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Hum, I was confused by the special case of END_LABEL. Yes, maybe we can rely on the bitTable of the current arc. So we need the label range, the bit table and the first label. I'm going to try to remove the labels.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

It works. PR#980 updated. I removed the labels for direct-addressing, except the first label of the arc. It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch).

I also did several things: The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc). I touched more code, but I think the code is now a bit clearer with more comments. I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR.

I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input).

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I quickly skimmed through the patch, the approach looks good to me.

With this new approach to compute labels from the bitset, maybe we should set a less aggressive default oversizing factor, e.g. 1? For instance if you index lowercase text, the range would most of the time be less than 24, which takes 3 bytes, plus one byte that we need for the first label. So this new encoding would already be used if you have 4 arcs or more, which sounds good to me already?

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

maybe we should set a less aggressive default oversizing factor, e.g. 1?

Indeed now direct addressing can be quite often more compact than binary search. If we set default oversizing factor 1, we will effectively reduce the FST size, but we will have less improvement on perf.

If we set oversizing factor to something greater than 1, let's say 1.5. Then we try to use the bytes gained on one side as a credit to oversize sometime, with the goal to use the same memory but with more performance gain.

I tried to get more visible improvement by accepting some memory increase. It seems what you're trying to achieve is same FST memory with some perf improvement. In this case we could implement the credit idea above, to keep track of memory gain and allow to use that gain to oversize some nodes. But even in this case it seems to me we should keep a default oversizing factor above 1.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If we set default oversizing factor 1, we will effectively reduce the FST size, but we will have less improvement on perf.

Out of curiosity, have you confirmed this? My intuition is that this new encoding is going to be used anyway on dense nodes because of how it tends to be more space-efficient on dense nodes. I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs, where using binary search or direct addressing doesn't matter much, so it wouldn't be a great use of the memory overhead?

I tried to get more visible improvement by accepting some memory increase. It seems what you're trying to achieve is same FST memory with some perf improvement.

I do indeed care about memory usage because I know of several users who already have gigabytes of memory spent on the terms index per node, so even something like a 10% increase could translate to hundreds of megabytes. There is definitely a speed/memory trade-off here and I could see value in spending a bit more memory on the terms index to achieve greater speed, but if you want to speed up access to the terms dictionary, it's unclear to me whether increasing this oversizing ratio is a better way to spend memory than e.g. decreasing the min/max block sizes in BlockTreeTermsWriter in order to be able to more often figure out that a terms doesn't exist without going to disk.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Out of curiosity, have you confirmed this?

It's by design. The oversizing factor is just a multiplier. The rule is "encode with direct addressing if size-of-direct-addressing <= oversizing-factor x size-of-binary-search". So if oversizing factor is 1 we don't encode with direct addressing unless we reduce or equal the size. So the whole FST memory can only reduce, never increase.

I worry that values greater than 1 might mostly make this new encoding used on nodes that don't have that many arcs

There is still the old rule that encodes with fixed length arcs only nodes with enough arcs and with low depth in the tree (FST.shouldExpandNodeWithFixedLengthArcs()).

so even something like a 10% increase could translate to hundreds of megabytes.

This Jira issue is "reduce size of FST due to use of direct-addressing". I thought we were trying here to reduce the memory increase, not necessarily remove it completely. That said I understand the concern. If we ensure by default we finally don't increase at all the memory by controlling direct-addressing (so the perf goal is somewhat reduced), how easy will it be for anyone to use BlockTree posting format with a custom setting for FST ? Will it be an additional BlockTree posting format with different name and settings?

I'll implement the memory reduction/oversizing credit to have more precise control. But the final decision for the default memory/perf balance should be taken based on some measures. @jpountz would you be able to have these measures?

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I added the expansion credit to the PR#980. This indeed gives more control on the oversizing factor and the result is better than I anticipated.

All the previous measures I did were for the worst case for direct addressing. This time I also measured the FST size when building from an English dictionary of 479K words (including alphanumeric).

I used direct addressing oversizing factor = 1 (ensures no oversizing on average)

For English words I measured 217K nodes, only 3.27% nodes (top nodes in the FST) are encoded with fixed length arcs, and 99.99% of them (of the 3.27%) with direct addressing. Overall FST memory reduced by 1.67%.

For worst case I measured 168K nodes, 50% nodes are encoded with fixed length arcs, and 14% of them (of the 50%) with direct addressing. Overall FST memory reduced by 0.8%.

I’m confident that with this last PR we will not increase the FST memory (compared to without direct addressing). At the same time the top nodes with many arcs will most often be encoded with direct addressing. And the worst case is controlled so we still ensure no memory increase (while still trying to use direct addressing when favorable).

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Something that was brought up in the past was that we should be able to get rid of FST.cachedRootArcs which is trying to do the same thing (encode the first 128 Arcs as a directly-addressable array). This could be a nice little follow-up issue.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

But the final decision for the default memory/perf balance should be taken based on some measures. Adrien Grand would you be able to have these measures?

Maybe the PKLookup task of the nightly benchmarks would be a good place to start? http://people.apache.org/\~mikemccand/lucenebench/PKLookup.html

we should be able to get rid of FST.cachedRootArcs which is trying to do the same thing

+1

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

we should be able to get rid of FST\.cachedRootArcs which is trying to do the same thing

+1 I suppose we will first merge the PR, wait for the nightly benchmarks, and then if the benchmarks are ok, create another Jira issue for this removal?

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yes, doing this in a separate JIRA sounds like a good idea to me. I reviewed the change yesterday and it looked good to me. I plan to merge tomorrow if there are no objections.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

+1 for merging, and handling cachedRootArcs separately

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 068b6babac38998f10a52b1776abca6218d970e1 in lucene-solr's branch refs/heads/master from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=068b6ba

LUCENE-8920: Reduce the memory used by direct addressing of arcs (#980)

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This gives a nice bump on the PKLookup task http://people.apache.org/\~mikemccand/lucenebench/PKLookup.html. Almost on par with the previous direct addressing patch.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 4836dfc8b808e7c198673d68df9d4ac0eceab0af in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4836dfc

LUCENE-8920: encapsulate FST.Arc data

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 0f4dcde4d983d7a0c09480dd849eb26bae443528 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0f4dcde

LUCENE-8920: remove Arc setters, moving implementations into Arc, or copying data into consumers

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit fe1653b9385cd6518691b9c397fb512249ca5e78 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fe1653b

LUCENE-8920: refactor FST binary search

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit e97380ad20d5f6af3a90c37383468678cf6bfcc7 in lucene-solr's branch refs/heads/branch_8x from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e97380a

LUCENE-8920: Fix bug preventing FST duplicate tails from being shared when encoded as array-with-gaps

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 010fb0b9942dc5407d74f1279030805b99a32973 in lucene-solr's branch refs/heads/branch_8x from Noble Paul https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=010fb0b

LUCENE-8920: precommit errors

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5dd9c4c04bcab9911b9a9f0b092eadc50262ee2c in lucene-solr's branch refs/heads/branch_8x from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5dd9c4c

LUCENE-8920: Reduce the memory used by direct addressing of arcs (#980)

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3c9140f24348f8d4e28bcc1a844ea503ed264f78 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3c9140f

LUCENE-8920: CHANGES entry.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 07101ed8cc140aafd4b2e1bc00841eb7d0cd037c in lucene-solr's branch refs/heads/master from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=07101ed

LUCENE-8920: CHANGES entry.

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

There were a few conflicts when backporting to branch_8x, so you might want to take a second look @msokolov @bruno-roustant to make sure I did not get anything wrong.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I scanned the diff of all the commits together, and I didn't see any issues, but boy that is a big change now. Thanks for handling the merge, @jpountz, and nice work on saving both space and time, @bruno-roustant!

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for checking @msokolov!

asfimport commented 4 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I want to confirm we have back-compat handled. Do we? A very quick look at the code shows we bumped the FST version and I see the FST's constructor accepts the previous version. But will it actually work – will this Lucene 8.4 code read FSTs written in previous indexes correctly? I know we have some back-compat indices but I don't recall when that is validated (on each test or only on release?)

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I want to confirm we have back-compat handled. Do we?

I'm pretty sure we are back-compatible. We introduce a new node type based on a new value of the node flags. The new code should read previous FST, and should write new FSTs with new direct-addressing nodes. That said I'm interested to know when it is validated automatically too.

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

There were a few conflicts when backporting to branch_8x, so you might want to take a second look

I verified also branch_8x, that seems good to me.

I created the follow up item for removing cachedRootArcs #10091.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I had tested with the previous version of this patch, and yes I also believe this preserves the same back-compat since the old arc encoding is read as before, but there is no automated testing to verify. It would be wise to run some manual spot-checking. We could eg build an "old" index with luceneutil and then run its tests with that index after upping the code. Or any test that runs on an existing index should do - is there a more convenient one?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I ran the luceneutil test just to be sure: yes, 8.x can read indexes created by 8.3

asfimport commented 4 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I don't recall when that is validated (on each test or only on release?)

It is tested on each test run, the test name is TestBackwardsCompatibility in lucene/backward-codecs.

In that case the version bump is not strictly needed since the new format is a superset of the old format, through I think we made the right choice of bumping the version in order to give a better error to users who would try to read 8.4 FSTs with an older release.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> In that case the version bump is not strictly needed since the new format is a superset of the old format, through I think we made the right choice of bumping the version in order to give a better error to users who would try to read 8.4 FSTs with an older release.

For example, this happened to us when trying to revert these changes, yet we did not regenerate kuromoji and nori "dictionaries," which are FSTs.

asfimport commented 4 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

TestFstDirectAddressing.testDeDupTails has a 44% failure rate (11/25 runs) on jenkins boxes since it was added by this commit (so far only on Uwe's machine, but on multiple OSes, and both master and 8.x)

asfimport commented 4 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

I added PR#1012 to fix the flapper test. This test verifies the size of the FST in a special case we care. The flappiness is surprising because there is no random involved. I suppose this is due to the way the JVM encodes or pads the bytes in the FST serialization.

yet we did not regenerate kuromoji and nori "dictionaries," which are FSTs

Actually I generated them (so with direct-addressing nodes). They are part of the merged PR.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 359864c65b9a3a84a3cf3646c92724775f6881b9 in lucene-solr's branch refs/heads/master from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=359864c

LUCENE-8920: Fix flapping TestFstDirectAddressing.testDeDupTails (#1012)

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Actually I generated them (so with direct-addressing nodes). They are part of the merged PR.

Yes, that's good - I was referring to an earlier incident when I rolled back the first version of this and forgot to regenerate the dictionaries. Having incremented the version number turned out to have been helpful since it made the cause of the resulting failures easier to understand.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I backported to 8x branch and beasted 20 times. I am mystified about the randomness myself, but the test still captures the super bad case it was intended to guard against, so I'm fine to add some slop and leave the mystery for another day.

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5a4600e6d3c0a8ff116f0787ff3636670e95dd7b in lucene-solr's branch refs/heads/branch_8x from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5a4600e

LUCENE-8920: Fix flapping TestFstDirectAddressing.testDeDupTails (#1012)

asfimport commented 4 years ago

ASF subversion and git services (migrated from JIRA)

Commit 359864c65b9a3a84a3cf3646c92724775f6881b9 in lucene-solr's branch refs/heads/jira/SOLR-13452_gradle_8 from Bruno Roustant https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=359864c

LUCENE-8920: Fix flapping TestFstDirectAddressing.testDeDupTails (#1012)

apache / lucene

Reduce size of FSTs due to use of direct-addressing encoding [LUCENE-8920] #9963

10326