apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Should FuzzyQuery match short terms too? [LUCENE-7439] #8491

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

Today, if you ask FuzzyQuery to match abcd with edit distance 2, it will fail to match the term ab even though it's 2 edits away.

Its javadocs explain this:

 * <p>NOTE: terms of length 1 or 2 will sometimes not match because of how the scaled
 * distance between two terms is computed.  For a term to match, the edit distance between
 * the terms must be less than the minimum length term (either the input term, or
 * the candidate term).  For example, FuzzyQuery on term "abcd" with maxEdits=2 will
 * not match an indexed term "ab", and FuzzyQuery on term "a" with maxEdits=2 will not
 * match an indexed term "abc".

On the one hand, I can see that this behavior is sort of justified in that 50% of the characters are different and so this is a very "weak" match, but on the other hand, it's quite unexpected since edit distance is such an exact measure so the terms should have matched.

It seems like the behavior is caused by internal implementation details about how the relative (floating point) score is computed. I think we should fix it, so that edit distance 2 does in fact match all terms with edit distance <= 2.


Migrated from LUCENE-7439 by Michael McCandless (@mikemccand), resolved Sep 15 2016 Attachments: LUCENE-7439.patch (versions: 3) Linked issues:

asfimport commented 8 years ago

Tim Allison (@tballison) (migrated from JIRA)

Added link to #6270. +1 on fixing this. Thank you!

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I was struggling to understand the control flow in FuzzyQuery/FuzzyTermsEnum, MultiTermQuery, TopTermsRewrite, etc., so as a first step here I cleaned up deprecated code and tried to simplify FuzzyTermsEnum somewhat.

The attached patch is just this cleanup; it doesn't change the behavior on short terms.

All tests pass and I confirmed performance (on Wikipedia) is unchanged. I plan to first commit this cleanup (master only, removing deprecations), and then separately tackle the short terms.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another iteration, just adding a randomized test, which seems to be passing.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit faf3bc3134c6e5ba3e2caa15762524872e083152 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=faf3bc3

LUCENE-7439: clean up FuzzyQuery/FuzzyTermsEnum sources

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 4db3e7b8a7ca818002af9041bf10660c25905915 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4db3e7b

LUCENE-7439: improve test case

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit ba085d4c8487d43089295c7b145c77eba19497b4 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba085d4

LUCENE-7439: back port test case to 6.x

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Here's a patch fixing FuzzyQuery to also accept small terms.

With the simplified FuzzyTermsEnum it was quite simple to fix it (remove the while loop), and to fix the test case to verify any term within the specified edit distance does match.

The one wrinkle is that such matches get a boost of 0.0, because the formula we use to compute the boost for a matched term (1.0 - editDistance / minTermLength) can be <= 0. I think this is fair: such matches are poor quality compared to longer term matches.

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit c7fb49d7b50d171c6d787253b9ab575218fef7fe in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7fb49d

LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 471f90cf825ee3106fef1fa4c1094d0ca461e7fb in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=471f90c

LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit c79d44f82814d6d798450a422f73f42891cb1ef5 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c79d44f

LUCENE-7439: move CHANGES entry

asfimport commented 7 years ago

Shalin Shekhar Mangar (@shalinmangar) (migrated from JIRA)

Closing after 6.3.0 release.

asfimport commented 7 years ago

Tim Allison (@tballison) (migrated from JIRA)

Talk about slow on the uptake (no pun intended)... faf3bc3 removed SlowFuzzyQuery, and I realize it has been deprecated for a long, long time. Is there any way to increase the maxEdits/MAXIMUM_SUPPORTED_DISTANCE to > 2?