apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Loosen up DirectSpellChecker's minPrefix requirements [LUCENE-4500] #5566

Open asfimport opened 12 years ago

asfimport commented 12 years ago

DirectSpellChecker currently mandates a minPrefix of 1 when editDistance=2. This prohibits a query of "nusglasses" from matching the indexed "sunglasses" term.

Granted, there can be performance issues with using a minPrefix of 0, but it's a risk that a user should be allowed to take if needed.


Migrated from LUCENE-4500 by Erik Hatcher (@erikhatcher), updated Feb 28 2019

asfimport commented 12 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

This patch to DirectSpellChecker does the trick (using accuracy=0.8 or less in the description example):

-    FuzzyTermsEnum e = new FuzzyTermsEnum(terms, atts, term, editDistance, Math.max(minPrefix, editDistance-1), true);
+    FuzzyTermsEnum e = new FuzzyTermsEnum(terms, atts, term, editDistance, minPrefix, true);

In a conversation with Robert Muir, we agreed that this, rather, should keep the default that restricts to minPrefix=1 when editDistance=2, but made optional to allow using a minPrefix=0.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah i think we should add an option to disable this heuristic.

It was basically a perf/relevance thing (in general edits of 2, esp considering a transposition is a single edit, along wotj minPrefix of 0 can yield surprisingly irrelevant stuff).

But if someone wants that... let them do it.

asfimport commented 5 years ago

Sascha Szott (@saschaszott) (migrated from JIRA)

Should we at least add a short note to the reference guide that explains the effect of setting minPrefix effectively to 1 (even if the user set it to 0) in case no suggestions with an edit distance of 1 are available in the term dictionary?