apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Analyzing Suggester [LUCENE-3842] #4915

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Since we added shortest-path wFSA search in #4788, and generified the comparator in #4874, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator output: surface form such as "the ghost of christmas past" weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.

This allows a lot of flexibility:

According to my benchmarks, suggestions are still very fast with the prototype (e.g. \~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).


Migrated from LUCENE-3842 by Robert Muir (@rmuir), 2 votes, resolved Oct 04 2012 Attachments: LUCENE-3842.patch (versions: 18), LUCENE-3842-TokenStream_to_Automaton.patch Linked issues:

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, folding in Rob's suggestions above (thanks!).

OK the super-large FST size was a false alarm: we were using RAMUsageEstimator, which then went and included the RAM usage of MockAnalyzer; I changed LookupBenchmarkTest to use FST.sizeInBytes instead:

.-- RAM consumption
JaspellLookup   size[B]:    9,815,152
TSTLookup       size[B]:    9,858,792
FSTCompletionLookup size[B]:      466,520
WFSTCompletionLookup size[B]:      507,640
AnalyzingCompletionLookup size[B]:      889,138

So we are still larger ... but not insanely so. I do think we could shrink the FST if we didn't add 2 bytes in the non-dup case ... I put a TODO to do this, but it'd make the exactFirst logic even hairier ...

I also put a TODO to use the end offset as a heuristic to "guess" whether final token was a partial token or not ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can we split Analyzer into indexAnalyzer and queryAnalyzer?

Can we also add 1 or 2 sugar ctors that use default values?

I'm thinking:

ctor(Analyzer analyzer) {
  this(analyzer, analyzer);
}

ctor(Analyzer index, Analyzer query) {
  this(index, query, default, default, default);
}

ctor(Analyzer index, Analyzer query, int option, int option, int option) {
  // this is the full ctor!
}
asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Good ideas! New patch, separating indexAnalyzer and queryAnalyzer, w/ the sugar ctors.

I also renamed to AnalyzingSuggester.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 thanks Mike. lets get it in!

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

+1. This is awesome. It would be great to get this in trunk.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ...

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

That TokenStreamToAutomaton is cool Mike! I can put that to use in my FST text tagger work.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391704

LUCENE-3842: refactor: don't make spooky State methods public

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1391686

LUCENE-3842: add AnalyzingSuggester

asfimport commented 11 years ago

Alexey Kudinov (migrated from JIRA)

I tried building the analyzing suggester model from an external file containing 1mln short phrases taken from Wikipedia titles. 2Gb of memory seems not enough, it runs forever and dies with OOM. What is the expected dictionary size? What is the benchmark behavior?

Thanks!

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The building process is unfortunately RAM intensive, but there are settings/knobs in the FST Builder API to tradeoff RAM required during building vs how small the resulting FST is. Maybe we need to expose control for these in AnalyzingSuggester ...

Can you share those 1M short phrases? What is the total number of characters across them?

asfimport commented 11 years ago

Alexey Kudinov (migrated from JIRA)

Setting maxGraphExpansions to some value > 0 (say, 30) ends with null reference exception. paths is null here: maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, paths.size()); Fixing this, the model loads after a while. With maxGraphExpansions < 0 it doesn't load regardless the dictionary size. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. The total dictionary file size is about 20Mb, but this doesn't really matter as I get similar behavior for even smaller one (2Mb). The dataset is from here: http://wiki.dbpedia.org/Downloads32, Titles in english. I took the values only and tried different sizes (10mln-1mln-0.1mln).

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops.

Ahhhh :) Yes this will cause lots of expansions / RAM used.

But NPE because paths is null sounds like a real bug.

OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found "so far". Can you open a new issue for this? We should separately fix it.

asfimport commented 11 years ago

Alexey Kudinov (migrated from JIRA)

I opened an issue for NPE - #6035

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thank you Alexey!

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.