apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Fuzzy suggester [LUCENE-3846] #4919

Closed asfimport closed 11 years ago

asfimport commented 12 years ago

Would be nice to have a suggester that can handle some fuzziness (like spell correction) so that it's able to suggest completions that are "near" what you typed.

As a first go at this, I implemented 1T (ie up to 1 edit, including a transposition), except the first letter must be correct.

But there is a penalty, ie, the "corrected" suggestion needs to have a much higher freq than the "exact match" suggestion before it can compete.

Still tons of nocommits, and somehow we should merge this / make it work with analyzing suggester too (#4915).


Migrated from LUCENE-3846 by Michael McCandless (@mikemccand), 1 vote, resolved Oct 30 2012 Attachments: LUCENE-3846_fuzzy_analyzing.patch, LUCENE-3846.patch (versions: 7)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, very much work in progress... there are some hard things to work out still.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

awesome! FST/A went a long way.

Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..."

Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy".

What I found works pretty good is considering this problem to be of nearest neighbor type. Namely, task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document.

What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around.

But as said, it depends what the task is.

Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

Apart from that, I see absolutely nothing more one on earth can do better :)

Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

awesome! FST/A went a long way.

Just a few random toughs, triggered by "... "corrected" suggestion needs to have a much higher freq than the "exact match"..."

Frequency influence is normally slightly more complicated than "only more popular", depending on search task user is facing. Only more popular helps if we assume user types it wrong and our suggestions dictionary is always right. But in cases where you have user who types it correctly, and collection contains errors you would cut all documents with "fuzzy".

What I found works pretty good is considering this problem to be of nearest neighbor type. Namely, task is to find closest matches to the query. Some are more and some less popular. Take for example a case where user types "black dog" and our collection contains document "blaKC dog", having frequency of blakc much lower than black, "only more popular" would miss this document.

What works out of the box pretty good is comparing frequency of query word and "candidate" to some reasonable cut-off and classifying them to "HF"/"LF" (high/low frequency) terms. It is based on the fact that typos are normally very seldom (if not, they should be treated as synonyms!). So if user types LF token, probably fuzzy candidate would be HF, and the other way around.

But as said, it depends what the task is.

Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

Apart from that, I see absolutely nothing more one on earth can do better :)

Sorry again for just shooting around with "wish lists" at you guys, my time-schedule really does not permit any serious work in form of patches.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Next level for "fuzzy *" in Lucene is going into specifying separate costs for Inserts/deletes, swaps and transpositions at character(byte) level and optionally considering position of edit. This brings precision++ if used properly, like in

Its probably "good enough" for this suggester to allow someone to re-rank their top-N with a StringDistance http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/suggest/src/java/org/apache/lucene/search/spell/StringDistance.java

such things are language and domain-specific and i think just adding this pluggability will work pretty well, rather than trying to complicate the actual intersection algorithm (which will ultimately never satisfy everyone anyway).

the default can be "internal metric" which means no re-ranking at all. This is how DirectSpellChecker works for example.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

sure as hell, re-ranking covers most of the cases. If you are not saturating top-N depth, it works just perfect, but if you are saturating top-n, you have to increase depth / number of allowed edits, which in turn hurts performance...

"rather than trying to complicate the actual intersection algorithm"

The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The logic in intersection algorithm would not have to know anything about the language specifics, it would be defined in cost matrix. But suporting cost matrix per edit operation deep down can be complex. You would simply reduce language/domain parametrization to configuration of costs in matrix

Like i said, it wont satisfy everyone. Lots of people are going to want ranking thats way more complex than just a cost matrix anyway, e.g. works on context, or phonemes, or other things.

So i think its good to just plug in the re-ranking so people can write whatever StringDistance they want and call it a day.

Personally i dont think context-free single-character cost-matrixes really help myself, feel free to show me evidence they do :)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is a nice benefit of the path-based best-first search (in the patch): it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones).

We don't need to explore that now, before committing this, but it's nice that we'll have the freedom to do so later.

Re-ranking definitely adds cost since you'll have to pull a bigger N. I think we'll likely have to somehow combine the cost and "isFuzzy" into a single cost, during the search, not after (reranking). Not sure how to do that yet...

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

feel free to show me evidence they do

Even here they help a lot, do not underestimate error model! (as in noisy channel, see http://norvig.com/spell-correct.html for a nice overview).

Examples, off the top of my head: in a case you search for Carin in a set {Karin, Marin, Darin}, (All valid names, at edit distance one) you would prefer to see Karin as a highest (to the only one) ranked fuzzy suggestion. (close consonants).

Or discount on swap(vowel ,vowel) vs swap(vowel/consonant, consonant). Mistaking one vowel for another is more probable than mistaking two consonants or consonant and vowel (as long as humans type).

Books, scanned using OCR have no problems with phonetics, but other...

Context is important, in-word context as part of "error model" (character level context, like previous character) but even more important is the context from the "language model", that normally dominates.

I could look for some interesting papers in my archives if you are not convinced yet :) This one is worth reading (http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf), tackles, among other things, exactly this topic.

it's easy to use a custom cost matrix. The cost can also be context-dependent too (based on past matched characters, though not [easily] future ones).

Great to hear that!
prefix based context is the only context at sub-word level I ever used. I doubt lookahead brings something.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I could look for some interesting papers in my archives if you are not convinced yet

none of that is evidence, those are just examples, I could list examples where a cost table hurts, too.

Evidence is some actual experiment with results showing some cost table has better results.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

Sure, give me enough annotated data and I can give you "close to optimal" cost matrix. There are (rather simple) ways to estimate these costs.

Or you are trying to argument there is no cost table better than the one filled with ones?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Or you are trying to argument there is no cost table better than the one filled with ones?

Yes. there is no evidence to the contrary, so why make things complicated.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

Robert, I am not talking from some abstract-theoretical point of view, I made my own experience on nontrivial Lucene datasets that are unfortunately not for sharing. Having possibility to train cost matrices per edit operation brings a lot, but you may have had another experience (different problems, different data...).

Without specifying concrete task (annotated data), there is no notion of "better", so this argument simply does not help ("show me it is better", "no you show me all ones matrix is better than any other", "no, no..."). It is simply about the experience we made in the past, different opinions.

I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ...

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I personally would not try this argument with molecular biology teams, and tell them their POM and BLOSUM matrices are worthless or to someone in record linkage community (Lucene was used in this context a lot) or ...

Thats ok, I would :)

I don't think we should complicate already-complicated things unless there is some clear benefit.

I'm not worried about offending anyone.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I applied this and fixed some compile errors while reading through the code. I also simplified the build() method and removed some minor things.

basically updated to trunk

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Simon!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's my hacky Fuzzy+Analyzing prototype.

But we need to fix intersectPrefixPaths to be able to efficiently intersect transition ranges (e.g. findTargetArc + readNextArc through the range?).

anyway we should see how slow this is compared to mike's: the advantage would be you would still get all the stuff AnalyzingSuggester has...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch that adds the missing intersect method and adds several tests derived from the AnalyzingSuggestorTest. The tests all pass at this point but I do get a weird failure if I run the benchmarks. somehow the TopNSearcher runs into a bad state which I can't really figure out.

this patch has several refactorings in AnalyzingSuggestor mainly to make testing easier in the fuzzy case (encapuslated some stuff into package private methods etc.) Yet there are tons of nocommits but at least we have something working.

Regarding the failure, I see a NoSuchELementException from the "queue" in the top N searcher that somehow removed the bottom and tries to pull the last element that doesn't exists. (stacktrace below) Yet, the funky thing is that this doesn't happen if I run this with exactFirst=false but the problem seems to be in the non-exactFirst part (see stacktrace). I use a direct intersection for exactFirst in the fuzzy case so that code is identical to analyzing suggestor since the intersection of the LD automaton doesn't return enough information to tell what is an exact match.

here is the stacktrace:

java.util.NoSuchElementException
    at java.util.TreeMap.key(TreeMap.java:1206)
    at java.util.TreeMap.lastKey(TreeMap.java:274)
    at java.util.TreeSet.last(TreeSet.java:384)
    at org.apache.lucene.util.fst.Util$TopNSearcher.addIfCompetitive(Util.java:339)
    at org.apache.lucene.util.fst.Util$TopNSearcher.search(Util.java:453)
    at org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.lookup(AnalyzingSuggester.java:581)
    at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:228)
    at org.apache.lucene.search.suggest.LookupBenchmarkTest$2.call(LookupBenchmarkTest.java:1)
    at org.apache.lucene.search.suggest.LookupBenchmarkTest.measure(LookupBenchmarkTest.java:253)
    at org.apache.lucene.search.suggest.LookupBenchmarkTest.runPerformanceTest(LookupBenchmarkTest.java:224)
    at org.apache.lucene.search.suggest.LookupBenchmarkTest.testPerformanceOnPrefixes6_9(LookupBenchmarkTest.java:192)
NOTE: reproduce with: ant test  -Dtestcase=LookupBenchmarkTest -Dtests.method=testPerformanceOnPrefixes6_9 -Dtests.seed=B5BAF2A9592263BC -Dtests.locale=fi_FI -Dtests.timezone=Africa/Lagos -Dtests.file.encoding=UTF-8
NOTE: test params are: codec=Lucene40: {}, sim=DefaultSimilarity, locale=fi_FI, timezone=Africa/Lagos
NOTE: Linux 2.6.38-16-generic amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=12,threads=1,free=578809008,total=1539571712
NOTE: All tests run in this JVM: [LookupBenchmarkTest]

mike if you get a chance it would be great if you could look into that one?!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

updated patch - mike conflicted me in the way :)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch fixing that exc in benchmark: it's a pre-existing bug in how bottom is set ... if the queue has come empty we just have to set bottom to null.

I think we should separately fix this... I'll commit (not sure why WFST/AnalyzingSuggester haven't hit this already). It only happens w/ exactFirst because this removes one of the competing topN paths from the queue, and then if there aren't enough suggestions remaining the queue empties before we find the topN results... I'll work up a test.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Argh, I forgot to svn add files ... i'll make a branch and commit there.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK branch here: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3846

I committed the last patch.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

just for the record here are my benchmark numbers for the latest branch code:

Test class requires enabled assertions, enable globally (-ea) or for Solr/Lucene subpackages only: org.apache.lucene.search.suggest.LookupBenchmarkTest
-- prefixes: 6-9, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 4650 [+- 12.56], \~kQPS: 11
AnalyzingSuggester queries: 50001, time[ms]: 444 [+- 1.89], \~kQPS: 113
JaspellLookup   queries: 50001, time[ms]: 181 [+- 0.96], \~kQPS: 275
TSTLookup       queries: 50001, time[ms]: 229 [+- 2.35], \~kQPS: 218
FSTCompletionLookup queries: 50001, time[ms]: 245 [+- 3.54], \~kQPS: 204
WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 1.72], \~kQPS: 413
-- prefixes: 100-200, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 5432 [+- 20.86], \~kQPS: 9
AnalyzingSuggester queries: 50001, time[ms]: 403 [+- 1.47], \~kQPS: 124
JaspellLookup   queries: 50001, time[ms]: 129 [+- 1.24], \~kQPS: 389
TSTLookup       queries: 50001, time[ms]: 68 [+- 4.03], \~kQPS: 739
FSTCompletionLookup queries: 50001, time[ms]: 254 [+- 2.60], \~kQPS: 197
WFSTCompletionLookup queries: 50001, time[ms]: 82 [+- 1.03], \~kQPS: 610
-- construction time
FuzzySuggester  input: 50001, time[ms]: 450 [+- 1.86]
AnalyzingSuggester input: 50001, time[ms]: 449 [+- 1.82]
JaspellLookup   input: 50001, time[ms]: 40 [+- 3.80]
TSTLookup       input: 50001, time[ms]: 111 [+- 3.33]
FSTCompletionLookup input: 50001, time[ms]: 213 [+- 4.36]
WFSTCompletionLookup input: 50001, time[ms]: 156 [+- 2.08]
-- prefixes: 2-4, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 3571 [+- 12.15], \~kQPS: 14
AnalyzingSuggester queries: 50001, time[ms]: 997 [+- 5.73], \~kQPS: 50
JaspellLookup   queries: 50001, time[ms]: 494 [+- 2.25], \~kQPS: 101
TSTLookup       queries: 50001, time[ms]: 1846 [+- 9.67], \~kQPS: 27
FSTCompletionLookup queries: 50001, time[ms]: 221 [+- 1.57], \~kQPS: 227
WFSTCompletionLookup queries: 50001, time[ms]: 457 [+- 9.05], \~kQPS: 109
-- RAM consumption
FuzzySuggester  size[B]:      889,138
AnalyzingSuggester size[B]:      889,138
JaspellLookup   size[B]:    9,815,128
TSTLookup       size[B]:    9,858,792
FSTCompletionLookup size[B]:      466,520
WFSTCompletionLookup size[B]:      507,640
asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is another benchmark with minPrefix=3 instead of 1 this looks much better:

-- prefixes: 6-9, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 2125 [+- 6.38], \~kQPS: 24
AnalyzingSuggester queries: 50001, time[ms]: 452 [+- 3.61], \~kQPS: 111
JaspellLookup   queries: 50001, time[ms]: 187 [+- 1.02], \~kQPS: 267
TSTLookup       queries: 50001, time[ms]: 263 [+- 1.78], \~kQPS: 190
FSTCompletionLookup queries: 50001, time[ms]: 269 [+- 1.59], \~kQPS: 186
WFSTCompletionLookup queries: 50001, time[ms]: 121 [+- 0.75], \~kQPS: 414
-- prefixes: 100-200, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 2778 [+- 16.56], \~kQPS: 18
AnalyzingSuggester queries: 50001, time[ms]: 414 [+- 1.70], \~kQPS: 121
JaspellLookup   queries: 50001, time[ms]: 133 [+- 1.85], \~kQPS: 376
TSTLookup       queries: 50001, time[ms]: 69 [+- 3.41], \~kQPS: 724
FSTCompletionLookup queries: 50001, time[ms]: 257 [+- 1.79], \~kQPS: 194
WFSTCompletionLookup queries: 50001, time[ms]: 83 [+- 3.31], \~kQPS: 605
-- prefixes: 2-4, num: 7, onlyMorePopular: true
FuzzySuggester  queries: 50001, time[ms]: 1310 [+- 3.30], \~kQPS: 38
AnalyzingSuggester queries: 50001, time[ms]: 995 [+- 8.03], \~kQPS: 50
JaspellLookup   queries: 50001, time[ms]: 507 [+- 4.19], \~kQPS: 99
TSTLookup       queries: 50001, time[ms]: 2148 [+- 16.63], \~kQPS: 23
FSTCompletionLookup queries: 50001, time[ms]: 223 [+- 2.14], \~kQPS: 224
WFSTCompletionLookup queries: 50001, time[ms]: 414 [+- 28.44], \~kQPS: 121
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This looks really good (especially if it holds up for a larger fst!) Thanks for hacking at the intersectPrefixPaths to get us going.

One caveat: the whole thing is optimized for the case where the "query analyzer" is very simple, particularly where there are no positionIncrement=0 tokens.

So if you use a query analyzer with say a synonymsfilter, then you hit the "nocommit: how slow can this be?". We should maybe benchmark that. if its really horrible we could just throw an exception and document that you cannot use synonyms etc in your query analyzer (use them at index-time instead) to prevent performance traps...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

FYI - I cleaned up the code & tests, synced with trunk, removed no-commits, enabled assertions again and documented the expensiveness of complex query analyzers. From my perspective this is good to go but we should run another round of reviews.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is ready! I'm attaching the reviewable patch w/ diffs on the branch vs trunk ...

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

cool man it looks good. we need a changes entry but from my side this looks good. we can tackle the todos on trunk

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think we might want to beast the tests on this before merging!

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, w/ CHANGES entry, fixing a couple issues Robert spotted (thanks!).

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1403780

LUCENE-3846: add new FuzzySuggester

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.