Closed asfimport closed 12 years ago
Michael McCandless (@mikemccand) (migrated from JIRA)
New patch, folding in Rob's suggestions above (thanks!).
OK the super-large FST size was a false alarm: we were using RAMUsageEstimator, which then went and included the RAM usage of MockAnalyzer; I changed LookupBenchmarkTest to use FST.sizeInBytes instead:
.-- RAM consumption
JaspellLookup size[B]: 9,815,152
TSTLookup size[B]: 9,858,792
FSTCompletionLookup size[B]: 466,520
WFSTCompletionLookup size[B]: 507,640
AnalyzingCompletionLookup size[B]: 889,138
So we are still larger ... but not insanely so. I do think we could shrink the FST if we didn't add 2 bytes in the non-dup case ... I put a TODO to do this, but it'd make the exactFirst logic even hairier ...
I also put a TODO to use the end offset as a heuristic to "guess" whether final token was a partial token or not ...
Robert Muir (@rmuir) (migrated from JIRA)
Can we split Analyzer into indexAnalyzer and queryAnalyzer?
Can we also add 1 or 2 sugar ctors that use default values?
I'm thinking:
ctor(Analyzer analyzer) {
this(analyzer, analyzer);
}
ctor(Analyzer index, Analyzer query) {
this(index, query, default, default, default);
}
ctor(Analyzer index, Analyzer query, int option, int option, int option) {
// this is the full ctor!
}
Michael McCandless (@mikemccand) (migrated from JIRA)
Good ideas! New patch, separating indexAnalyzer and queryAnalyzer, w/ the sugar ctors.
I also renamed to AnalyzingSuggester.
Robert Muir (@rmuir) (migrated from JIRA)
+1 thanks Mike. lets get it in!
Sudarshan Gaikaiwari (migrated from JIRA)
+1. This is awesome. It would be great to get this in trunk.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ...
David Smiley (@dsmiley) (migrated from JIRA)
That TokenStreamToAutomaton is cool Mike! I can put that to use in my FST text tagger work.
Commit Tag Bot (migrated from JIRA)
[branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391704
LUCENE-3842: refactor: don't make spooky State methods public
Commit Tag Bot (migrated from JIRA)
[branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391686
LUCENE-3842: add AnalyzingSuggester
Alexey Kudinov (migrated from JIRA)
I tried building the analyzing suggester model from an external file containing 1mln short phrases taken from Wikipedia titles. 2Gb of memory seems not enough, it runs forever and dies with OOM. What is the expected dictionary size? What is the benchmark behavior?
Thanks!
Michael McCandless (@mikemccand) (migrated from JIRA)
The building process is unfortunately RAM intensive, but there are settings/knobs in the FST Builder API to tradeoff RAM required during building vs how small the resulting FST is. Maybe we need to expose control for these in AnalyzingSuggester ...
Can you share those 1M short phrases? What is the total number of characters across them?
Alexey Kudinov (migrated from JIRA)
Setting maxGraphExpansions to some value > 0 (say, 30) ends with null reference exception. paths is null here: maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, paths.size()); Fixing this, the model loads after a while. With maxGraphExpansions < 0 it doesn't load regardless the dictionary size. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. The total dictionary file size is about 20Mb, but this doesn't really matter as I get similar behavior for even smaller one (2Mb). The dataset is from here: http://wiki.dbpedia.org/Downloads32, Titles in english. I took the values only and tried different sizes (10mln-1mln-0.1mln).
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops.
Ahhhh :) Yes this will cause lots of expansions / RAM used.
But NPE because paths is null sounds like a real bug.
OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found "so far". Can you open a new issue for this? We should separately fix it.
Alexey Kudinov (migrated from JIRA)
I opened an issue for NPE - #6035
Michael McCandless (@mikemccand) (migrated from JIRA)
Thank you Alexey!
Uwe Schindler (@uschindler) (migrated from JIRA)
Closed after release.
Since we added shortest-path wFSA search in #4788, and generified the comparator in #4874, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.
In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator output: surface form such as "the ghost of christmas past" weight: the weight of the suggestion
we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.
This allows a lot of flexibility:
According to my benchmarks, suggestions are still very fast with the prototype (e.g. \~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).
Migrated from LUCENE-3842 by Robert Muir (@rmuir), 2 votes, resolved Oct 04 2012 Attachments: LUCENE-3842.patch (versions: 18), LUCENE-3842-TokenStream_to_Automaton.patch Linked issues:
5516