Analyzing Suggester [LUCENE-3842]

asfimport commented 12 years ago

Since we added shortest-path wFSA search in #4788, and generified the comparator in #4874, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching.

In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator output: surface form such as "the ghost of christmas past" weight: the weight of the suggestion

we make an FST with PairOutputs<weight,output>, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion.

This allows a lot of flexibility:

Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...", it will suggest "the ghost of christmas past"
we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!)
this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that...
other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever.

According to my benchmarks, suggestions are still very fast with the prototype (e.g. \~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc).

Migrated from LUCENE-3842 by Robert Muir (@rmuir), 2 votes, resolved Oct 04 2012 Attachments: LUCENE-3842.patch (versions: 18), LUCENE-3842-TokenStream_to_Automaton.patch Linked issues:

SOLR-2479
- SOLR-2479
- 5516

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, folding in Rob's suggestions above (thanks!).

OK the super-large FST size was a false alarm: we were using RAMUsageEstimator, which then went and included the RAM usage of MockAnalyzer; I changed LookupBenchmarkTest to use FST.sizeInBytes instead:

.-- RAM consumption
JaspellLookup   size[B]:    9,815,152
TSTLookup       size[B]:    9,858,792
FSTCompletionLookup size[B]:      466,520
WFSTCompletionLookup size[B]:      507,640
AnalyzingCompletionLookup size[B]:      889,138

So we are still larger ... but not insanely so. I do think we could shrink the FST if we didn't add 2 bytes in the non-dup case ... I put a TODO to do this, but it'd make the exactFirst logic even hairier ...

I also put a TODO to use the end offset as a heuristic to "guess" whether final token was a partial token or not ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can we split Analyzer into indexAnalyzer and queryAnalyzer?

Can we also add 1 or 2 sugar ctors that use default values?

I'm thinking:

ctor(Analyzer analyzer) {
  this(analyzer, analyzer);
}

ctor(Analyzer index, Analyzer query) {
  this(index, query, default, default, default);
}

ctor(Analyzer index, Analyzer query, int option, int option, int option) {
  // this is the full ctor!
}

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Good ideas! New patch, separating indexAnalyzer and queryAnalyzer, w/ the sugar ctors.

I also renamed to AnalyzingSuggester.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1 thanks Mike. lets get it in!

asfimport commented 12 years ago

Sudarshan Gaikaiwari (migrated from JIRA)

+1. This is awesome. It would be great to get this in trunk.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ...