FST should allow controlling how hard builder tries to share suffixes [LUCENE-3289]

asfimport commented 13 years ago

Today we have a boolean option to the FST builder telling it whether it should share suffixes.

If you turn this off, building is much faster, uses much less RAM, and the resulting FST is a prefix trie. But, the FST is larger than it needs to be. When it's on, the builder maintains a node hash holding every node seen so far in the FST – this uses up RAM and slows things down.

On a dataset that Elmer (see java-user thread "Autocompletion on large index" on Jul 6 2011) provided (thank you!), which is 1.32 M titles avg 67.3 chars per title, building with suffix sharing on took 22.5 seconds, required 1.25 GB heap, and produced 91.6 MB FST. With suffix sharing off, it was 8.2 seconds, 450 MB heap and 129 MB FST.

I think we should allow this boolean to be shade-of-gray instead: usually, how well suffixes can share is a function of how far they are from the end of the string, so, by adding a tunable N to only share when suffix length < N, we can let caller make reasonable tradeoffs.

Migrated from LUCENE-3289 by Michael McCandless (@mikemccand), 1 vote, resolved Jul 11 2011 Attachments: LUCENE-3289.patch (versions: 2)

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial rough patch showing the idea.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

NOTE: patch applies to 3.x.

I ran the patch on the titles, varying the max prefix sharing length:

Len	FST Size	Seconds
1	135446807	8.2
2	137632702	8.5
3	135177994	8.3
4	132782016	8.3
5	130415331	8.4
6	128086200	8.0
7	125797396	8.2
8	123552157	8.5
9	121358375	8.4
10	119228942	8.1
11	117181180	8.8
12	115229788	8.7
13	113388260	9.5
14	111664442	9.0
15	110059167	9.2
16	108572519	9.7
17	107201905	9.8
18	105942576	10.3
19	104791497	10.1
20	103745678	11.1
21	102801693	10.8
22	101957797	11.4
23	101206564	11.1
24	100541849	11.0
25	99956443	11.1
26	99443232	12.9
27	98995194	13.2
28	98604680	13.9
29	98264184	13.5
30	97969241	13.6
31	97714049	13.8
32	97494104	14.3
33	97304045	14.0
34	97140033	14.3
35	96998942	14.6
36	96877590	16.5
37	96773039	16.9
38	96682961	16.6
39	96605160	17.8
40	96537687	18.3
41	96479286	17.8
42	96428710	17.5
43	96384659	18.9
44	96346174	17.0
45	96312826	19.3
46	96283545	17.8
47	96257708	19.4
48	96235159	19.0
49	96215220	18.7
50	96197450	19.6
51	96181539	17.3
52	96167235	16.9
53	96154490	17.7
54	96143081	18.8
55	96132905	17.4
56	96123776	17.5
57	96115462	20.7
58	96108051	19.2
59	96101249	19.1
60	96095107	18.7
ALL	96020343	22.5

Very very odd that FST size first goes up at N=2... not yet sure why. But from this curve it looks like there is a sweet spot around maybe N=24. I didn't measure required heap here, but it also will go down as N goes down.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, I think it's ready to commit!

Separately we should think about how suggest module should set these... I left it at "costly but perfect minimization".

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think thats probably good for most cases?

In the example you gave, it seems that FST might not be the best algorithm? The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah I think "costly but perfect minimization" is the right default.

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Exactly. This is a very specific use case (long suggestions).

asfimport commented 13 years ago

Eks Dev (migrated from JIRA)

The strings are extremely long (more like short documents) and probably need to be "compressed" in some different datastructure, e.g. a word-based one?

That would be indeed cool, e.g. FST with words (ngrams?) as symbols. Ages ago we used one trie, for all unique terms to get prefix/edit distance on words and one word-trie (symbols were words via symbol table) for "documents". I am sure this would cut memory requirements significantly for multiword cases when compared to char level FST. e.g. TermDictionary that supports ord() could be used as a symbol table.

apache / lucene

FST should allow controlling how hard builder tries to share suffixes [LUCENE-3289] #4362