apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

FSTs can be very space-inefficient on array-expanded nodes [LUCENE-8084] #9132

Open asfimport opened 6 years ago

asfimport commented 6 years ago

We have FSTs which operate on a larger alphabet (keys in int) space and emit character sequence outputs. I noticed that certain nodes get expanded into fixed-size arrays to accelerate lookups (binary search). This has a potential problem, however, when the outputs emit larger blobs of data (say, one of the outputs is very long, all the others are small). Then the fixed-size array is very much overallocated, as evident on the attached picture.

I wonder if it'd be better to encode the array as fixed-size, but without the outputs and use a local fixed-size pointer into a "value pool" somewhere next to the node's arcs. Theoretically this "value pool" could even be shared by all of automaton's nodes (saved once at the end or flushed periodically).

Just a thought.

capture-4.png


Migrated from LUCENE-8084 by Dawid Weiss (@dweiss), updated Dec 21 2017 Attachments: capture-4.png Linked issues:

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like that idea!

Maybe in the meantime we should improve the logic that decides if an node should use the dense encoding to take into account waste due to large outputs?

asfimport commented 6 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Hi Mike! I did see the todos in the logic of whether to expand a node or not, but I honestly didn't have an idea what would be a good improvement. We sometimes do need those branches to be expanded for fast lookups, especially close to the root level (where the cost of a linear scan is huge when you consider millions of lookups). In fact, even the limit on the root arc cache is something we had to bypass manually (root level fanout is huge if you have varied int keys) and we expanded all of those arcs.

Perhaps we have a specific use case, I don't know. I'll experiment a bit and see, just wanted to mark the fact that the current storage of expanded nodes is fairly inefficient sometimes.