apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

FST has hard limit max size of 2.1 GB [LUCENE-3298] #4371

Closed asfimport closed 11 years ago

asfimport commented 13 years ago

The FST uses a single contiguous byte[] under the hood, which in java is indexed by int so we cannot grow this over Integer.MAX_VALUE. It also internally encodes references to this array as vInt.

We could switch this to a paged byte[] and make the far larger.

But I think this is low priority... I'm not going to work on it any time soon.


Migrated from LUCENE-3298 by Michael McCandless (@mikemccand), resolved Feb 08 2013 Attachments: LUCENE-3298.patch (versions: 4) Linked issues:

asfimport commented 13 years ago

James Dyer (@jdyer1) (migrated from JIRA)

Here's a patch for this. Not fully optimized, but possibly a start.

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

This looks good, James. The thing is: a 2GB fst is actually quite huge; until we can realistically hit such sizes it makes little sense to replace the code to operate on longs instead of ints and add an intermediate layer between the byte[] that stores fst data. I believe this will affect performance, even on 64-bit systems and on 32-bit systems performance degradation will be significant.

This said, your patch can reside in jira and wait until people hit the 2gb barier – when this happens, I'm sure it'll be useful.

asfimport commented 13 years ago

James Dyer (@jdyer1) (migrated from JIRA)

Dawid, thanks for looking at this and for your comment. I realize completing this issue might only have theoretical value, but working on this was a nice way for me to learn a little about this fst implementation. Informally, I was seeing \~20% performance loss on my 32-bit windows machine. I'm sure what I have here can be tightened, but should there ever be a use case for this (beyond Test2BPostings!) I would think we'd probably want a flag users can set for whether or not it should use the "long" address space, etc, or something along those lines.

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Absolutely, your code is valuable and I'm sure it'll be handy at some point. The best solution would be to change the Java spec at some point and allow larger arrays :)

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Dawid, I wanted to let you know that we've reached the 2GB barrier.

We're using a heavily modified version of FSTLookup to create an autocomplete system over 2.3 billion queries (and growing, it will be more than 10-15B when we add the data for infix matches).

In order to circumvent the 2.1GB limitation, we changed the code so that every bucket uses a different FST (as per Robert Muir's recommendation), but still we're having problems in the individual buckets because our dataset is huge.

We'll give a try to this patch and will let you know.

Thanks Carlos

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

BTW, the exception we get when reaching the size barrier is: Exception in thread "main" java.lang.AssertionError: size must be positive (got -2147483560): likely integer overflow?

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Yes, this looks like integer overflow. Can you tell anything more about the nature of your data or provide a sample? How much input for the FST to exceed 2GB? Perhaps there is something we can do to optimize the input to make the automaton smaller?

The rule of thumb is: try to get as many shared prefixes and suffixes as possible because these compress nearly ideally. A full FST is not a good answer for compressing data with unique prefixes or suffixes.

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Hello Dawid:

First of all, for the FSTLookup we're using a FST and not a FSA, that means that we've replaced the NoOutputs with ByteSequenceOutputs. The reason for this is that our autocomplete system requires to get some metadata about the sentence being matched, because we do some post-match reranking of the results in order to get quality suggestions (the buckets themselves are quite big and the results are quite poor without reranking, specially for short prefixes)

The queries we use with the FSTLookup are generated using templates. Our domain is hotels, simplifying things that means that we have K templates (i.e. an example template could be hotels + amenity + city), then we have I individuals (i.e. all the cities in the world ), and we generate queries like:

hotels with jacuzzi in barcelona

The output for this key contains some metadata that is useful for reranking matches or displaying the matches to the user. We encode it in as few bytes as possible, and we make sure that things that are common to many sentences are put first so that they can get compacted appropriately. The metadata in the output is the following (in order):

1) Display version of the sentence. The match sentence described above is normalized and transliterated to ASCII, i.e. it converts acutes to normal characters, and stuff like this. We have the "display" version to show the correctly formatted version to the user. We apply simple transliteration rules so that the differnece between the display and the match is not very big (i.e. , we don't apply stuff like pinyin transliteration for chinese). 2) The ID of the city (an int that will help us to later execute the query and give the right results) 3) The "score" of the sentence (float, used to rerank the matches) 4) A byte to represent the number of alternate names used in the query (i.e. "barcelona" is also referred as "la ciudad condal", if the prefix is "hotels in" we don't want to produce both matches, just the most common one. This value is used for reranking. 5) The number of skipped tokens (we use this for infix matches). This is another byte. This is used for reranking, to give priority to prefix queries (skipped tokens = 0) over infix queries, and if all the matches are infix, to give the ones where the matched word is closer to the beginning of the sentence).

The sentences are quite repetitive, but we don't seem to achieve an optimal compression. A GZ-compressed file of 250MB is generating a FST of around 1.9GB.

We've applied the patch to trunk, but I cannot get it to compile. Maybe the version that is uploaded to JIRA is out of date. Do you have a newer version?

Thanks a lot, Carlos

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

We've also tried to remove some Outputs to see how the outputs were affecting the total automaton size, but the difference is not too much. So it seems that the size is mostly related to the huge number of sentences.

I said before that the sentences are quite repetitive, but to be more precise, some prefixes of the sentence are quite repetitive.

hotels with jacuzzi in barcelona hotels with jacuzzi in madrid hotels with jacuzzi in berlin ...

We tested at the beginning with the FST visualization tool and it seemed to do a good job (i.e. placing the outputs in the right nodes and leveraging shared prefixes).

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

I forgot to tell you numbers about our input. We have 2 billion sentences without infixes and 6.7 billion sentences with infixes. The 2 billion sentences are 32GB compressed (350GB uncompressed) and the 6.7 billion are 132 GB compressed (1.4TB uncompressed)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

The first thing that comes to my mind is to use an int or long file offset as the output and store all other outputs in another file. This will help you keep the FST smaller and thus store more arcs inside.

So, I don't count the outputs for now. The example you've given is too simple to say anything – this should conflate prefixes nicely. Although if you have lots of combinations of different prefixes (say, "motels with jacuzzi in [barcelona|madrid|berlin]" then I can see how the automaton can explode there on internal nodes (if your input "suggestions" are very long and have lots of infix variants). I don't see any direct solution to this.

There is a data structure called LZTrie which does infix compression (or rather: it attempts to store similar subtrees in the automaton only once, replacing their duplicated occurrences with a pointer). That data structure is quite difficult to implement efficiently (construction) and the traversals are not that straightforward either. I don't have a working implementation unfortunately but if you have time (and would like to contribute!) then its description is here:

http://www.irb.hr/hr/home/ristov/papers/RistovLZtrieRevision1.pdf

That patch wasn't mine, by the way, so I don't have any newer one either.

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Hello Dawid,

The sentences have variants at different levels. The first is the one you mention, different prefixes for different accomodation types. The second one is different positions of the prepositional phrases of the query (i.e. "hotels in barcelona with jacuzzi" and "hotels with jacuzzi in barcelona"). The third one we have is sentences with and without prepositions ("hotels barcelona jacuzzi").

W.r.t the patch, sorry, I got confused. James, do you have a version of this patch that works with trunk?

Thanks a lot.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

If you have so many permutations then they become different paths in the FST and it will grow exponentially to the number of input words/ combinations. To be honest, this looks more suitable for a regular inverted index search.

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Yeap, at the beginning of this project we tried to implement this autocomplete system using regular inverted indexes, but the response time required for autocomplete to work from a user perspective is very low (<50ms), and it would be quite hard to achieve such a performance with inverted indexes.

I still think this is the way to go, but as you say we have to be careful with the data generation part. Most of the work should be put in making sure that the data is well distributed and organized in order to avoid combinatorial explosion.

Let me go in detail with the sources of data permutations and the reasoning behind them:

1) With regards to infix matches, if a user types "barcelona" we want to match "hotels in barcelona". In order to achieve this, we generate:

hotels in barcelona => hotels in barcelona in barcelona => hotels in barcelona barcelona => hotels in barcelona

The FST should be able to conflate these prefixes nicely in just one path, right?. Therefore this part shouldn't be a problem.

2) In addition, another feature we want to achieve is to be able to match inputs without prepositions. That means that if the user types "hotels barcelona jacuzzi", we should be able to match "hoteles in barcelona with jacuzzi". Now the only way we envision of doing it properly is to generate this permutation within the data:

hotels barcelona jacuzzi => hotels in barcelona with jacuzzi

I can see how this can explode the FST by creating different branches. Theoretically this could be done at runtime without the need of generating the data, but we don't see a way to do it in a clean way. To make things more complicated :) we've implemented fuzzy matching at query time (we use a levenshtein automata generated with the user input + an edit distance and then we intersect with the FST), and this is making very complicated to do preposition handling at query time.

3) PP permutations (i.e. hoteles in barcelona with jacuzzi and hoteles with jacuzzi in barcelona). I don't really see a way to work around this. Probably we need to be careful and only generate these permutations for the top-K cities, in order to limit the potential size.

Summarizing, I believe that we can reduce the set of "bad permutations" a lot if we can figure out how to implement the prepositions at runtime. If you have any ideas, let me know. Thanks! :)

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

time required for autocomplete to work from a user perspective is very low (<50ms),

I've seen a presentation by Greg Donovan at Lucene Revolution San Francisco, here: http://www.lucidimagination.com/devzone/events/conferences/revolution/2011/solr-and-lucene-etsy

they seem to have a lot of traffic and still use a shard of Solr servers to do contextual suggestions. Maybe it'd be easier to buy more hardware than try to squeeze something into an FST that it doesn't handle well (permutations). Just a thought.

hotels in barcelona => hotels in barcelona, The FST should be able to conflate these prefixes nicely in just one path, right?.

The FST will be able to conflate suffixes if you have no outputs. If you do have different outputs these need to be stored somewhere too; for outputs with a common part, the common part is pushed towards the root of the FST, but for byte sequences this is unlikely so the output will actually have to differentiate these paths somehow and create at least a single separate node/arc in the FST. I tell this without checking, but this is my intuition – verify in practice if you want to.

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Thanks for the presentation. It's very interesting.

Now that we've invested very significant time with this approach, we'd like to stick a little bit more with it and see where we can get to. The FST approach, given that is way more low level, will give us more control of the functionality down the road, which definitely will prove benefitial mid-term. If needed due to space requirements, we can think of replacing FST by LZTrie if we need more infix compression for the permutations.

Re: next steps, you commented above that you may consider including this patch into the codebase when you have people that have the need. We obviously would be very interested in this patch getting into trunk.

In terms of performance, James is speaking about a 20% performance loss in a 32-bit machine, we're seeing less performance degradation in a 64-bit machine, something around 10-15% depending on the specific FST and query. If you or James envision any way to optimize it, let me know, we can give a hand here if you tell us the potential paths to make it more efficient.

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Perhaps we should just have two implementations (a 32 bit one, and a 64 bit one)?

asfimport commented 12 years ago

James Dyer (@jdyer1) (migrated from JIRA)

Carlos,

I'm not sure how much help this is, but you might be able to eke a little bit of performance if you can tighten RewritablePagedBytes.copyBytes(). You'll note it currently moves the From-Bytes into a temp array then writes that back to the fst an the To-Bytes location. Note also, the one place this gets called, it used to be a simple "System.ArrayCopy". So if you can make it copy in-place that might claw back the performance loss a little. Beyond this, a different pair of eyes might find more ways to optimize. In the end though you will likely never make it perform quite as well as the simple array.

Also, it sounds as if you've maybe done work to sync this with the current trunk. If so, would you mind uploading the updated patch?

Also if you end up using this, be sure to test thoroughly. I implemented this one just to gain a little familiarity with the code and I do not claim any sort of expertise in this area, so beware! But all of the regular unit tests did pass for me. I was meaning to try to run test2bpostings against this but wasn't able to get it set up. If I remember this issue came up originally because someone wanted to run test2bpostings with memorycodec and it was going passed the limit.

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Hello James,

Now we're using it and for the moment we haven't noticed any problems (although I must say that we haven't done extensive testing). I'll let you know if we find any.

I haven't updated the patch to sync with the current trunk, I've just reverted to the appropriate version of Lucene identified in the patch and then I've applied it there. If you have some time, it would be great if you can sync the patch with the current trunk.

As you suggest, I'll also take a look at the sections you mention to see if we can make it more efficient.

Thanks Carlos

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

Hi Carlos

I am interested in your implementation of FSTLookup where you are using a FST with ByteSequenceOutputs. Would it be possible for you to share your implementation.

Thanks Sudarshan

asfimport commented 12 years ago

Carlos González-Cadenas (migrated from JIRA)

Hi Sudarshan,

I don't believe that my implementation is gonna be of much practical value for the general public. Note that, as described above, in my implementation I store custom data that is useful for my application, but it almost certainly won't make any sense for the rest of applications.

I'm happy to tell you how to modify the code to store your own outputs, it's quite easy: 1) First you have to enable it at the code level, you just need to change NoOutputs by ByteSequenceOutputs and then in all the references of Arc<Object> or FST<Object> you need to change them by Arc<BytesRef> and FST<BytesRef>. 2) At build time, you need to store something in the output. You can do it by creating the appropriate BytesRef and including it in the builder.add() call instead of the placeholder value that is present now. 3) At query time, you need to collect the output while traversing the FST (note that the output may be scattered through the whole arc chain) and then you can process it in the way specific to your app. Probably you want to do it in the collect() method (when the LookupResults are created).

I believe that's all. If you have any questions, let me know.

Thanks Carlos

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sudarshan,

If you take a look at the trunk version of FSTLookup it uses FSTCompletion underneath and that class in turn stores arbitrary byte sequences (text is converted to UTF8). Not byte outputs, but you could create your "suggestions" by concatenating input with output, divided with a marker or something. This will bloat the automaton, but if your data is relatively small, it's not a problem and you can still extract your "outputs" after suggestions are retrieved from the FST. Take a look at FSTCompletion and FSTCompletionBuilder (and tests), they'll be helpful.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm linking the two precursor issues here ... once we resolve those issues I think all that remains here is a the cutover from int -> long in various places.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial test to confirm FSTs can grow beyond 2GB (it fails today!).

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial patch with int -> long in lots of places ... the Test2BFST is still running ...

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, beefing up the test (it passes: takes 10 GB heap and \~ 2 hours on my machine, making various 3 GB / 3 B node FST and test them), removing nocommits. I think it's ready.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

+  void finish(long startNode) throws IOException {
+    if (this.startNode != -1) {
+      throw new IllegalStateException("already finished");
+    }
     if (startNode == FINAL_END_NODE && emptyOutput != null) {
       startNode = 0;
     }
-    if (this.startNode != -1) {
-      throw new IllegalStateException("already finished");
-    }

Doesn't this change the logic of how it works? I also wonder about the perf. penalty this patch brings (on 64 systems mostly, but 32-bit JVMs will be most affected).

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Doesn't this change the logic of how it works?

I don't think it does? I just move the "you called me twice check" to the front of the method.

I also wonder about the perf. penalty this patch brings (on 64 systems mostly, but 32-bit JVMs will be most affected).

I'll test on a 64 bit CPU ...

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

duh... sorry i missed the fact that method arg is called the same as field.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Search perf looks fine ... maybe a bit slower for the terms dict/FST heavy queries (PKLookup, Fuzzy1/2, Respell):

                    Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
              AndHighMed       66.76      (1.8%)       64.53      (0.8%)   -3.3% (  -5% -    0%)
                PKLookup      300.07      (1.1%)      295.77      (2.3%)   -1.4% (  -4% -    2%)
                 Respell       71.30      (3.0%)       70.35      (3.2%)   -1.3% (  -7% -    4%)
                  Fuzzy2       78.05      (2.6%)       77.14      (2.3%)   -1.2% (  -5% -    3%)
        HighSloppyPhrase       35.17      (4.6%)       34.82      (4.4%)   -1.0% (  -9% -    8%)
                  Fuzzy1       87.15      (3.2%)       86.36      (2.2%)   -0.9% (  -6% -    4%)
         LowSloppyPhrase      198.02      (4.5%)      196.62      (4.4%)   -0.7% (  -9% -    8%)
              AndHighLow     2344.92      (4.0%)     2328.77      (5.0%)   -0.7% (  -9% -    8%)
                 Prefix3      146.38      (1.6%)      145.83      (1.7%)   -0.4% (  -3% -    2%)
             MedSpanNear      125.96      (4.3%)      125.65      (4.4%)   -0.2% (  -8% -    8%)
             LowSpanNear       88.16      (2.2%)       87.97      (2.0%)   -0.2% (  -4% -    4%)
                  IntNRQ       15.10      (2.5%)       15.07      (2.3%)   -0.2% (  -4% -    4%)
              HighPhrase       17.05      (4.5%)       17.03      (5.4%)   -0.1% (  -9% -   10%)
            HighSpanNear       11.97      (4.3%)       11.96      (4.0%)   -0.1% (  -8% -    8%)
             AndHighHigh       71.79      (2.0%)       71.80      (0.9%)    0.0% (  -2% -    2%)
                Wildcard       41.93      (1.5%)       41.98      (1.3%)    0.1% (  -2% -    2%)
               MedPhrase       41.43      (1.7%)       41.48      (1.8%)    0.1% (  -3% -    3%)
                 MedTerm      199.42      (6.6%)      200.15      (6.5%)    0.4% ( -11% -   14%)
                HighTerm      142.32      (6.9%)      142.89      (6.6%)    0.4% ( -12% -   14%)
         MedSloppyPhrase       25.56      (5.9%)       25.67      (6.4%)    0.4% ( -11% -   13%)
                 LowTerm     1016.02      (3.3%)     1021.04      (3.2%)    0.5% (  -5% -    7%)
               LowPhrase       67.43      (2.1%)       67.80      (2.6%)    0.5% (  -4% -    5%)
              OrHighHigh       22.58      (5.0%)       22.89      (5.3%)    1.4% (  -8% -   12%)
               OrHighMed       52.47      (5.2%)       53.31      (5.2%)    1.6% (  -8% -   12%)
               OrHighLow       24.74      (5.4%)       25.18      (5.3%)    1.8% (  -8% -   13%)

I also tested building FST from all Wikipedia terms:

I also tested tokenizing first 100K Japanese Wikipedia docs w/ Kuromoji:

Only a wee bit slower (could easily be noise).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This all looks like noise.

+1 to commit.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

The impact will show on 32-bit systems I'm pretty sure of that. We don't care about hardware archaeology, do we? :) +1.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The impact will show on 32-bit systems I'm pretty sure of that.

Yeah I think it will too ...

We don't care about hardware archaeology, do we?

I think Lucene should continue to run on 32 bit hardware, but I don't think performance on 32 bit is important, ie we should optimize for 64 bit performance.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1433026

LUCENE-3298: FSTs can now be larger than 2GB, have more than 2B nodes

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1435141

5742, #5747, #5743, LUCENE-3298: Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Looks like this can be resolved?

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.