Closed asfimport closed 8 years ago
Tim Allison (@tballison) (migrated from JIRA)
Added link to #6270. +1 on fixing this. Thank you!
Michael McCandless (@mikemccand) (migrated from JIRA)
I was struggling to understand the control flow in FuzzyQuery/FuzzyTermsEnum, MultiTermQuery, TopTermsRewrite, etc., so as a first step here I cleaned up deprecated code and tried to simplify FuzzyTermsEnum somewhat.
The attached patch is just this cleanup; it doesn't change the behavior on short terms.
All tests pass and I confirmed performance (on Wikipedia) is unchanged. I plan to first commit this cleanup (master only, removing deprecations), and then separately tackle the short terms.
Michael McCandless (@mikemccand) (migrated from JIRA)
Another iteration, just adding a randomized test, which seems to be passing.
ASF subversion and git services (migrated from JIRA)
Commit faf3bc3134c6e5ba3e2caa15762524872e083152 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=faf3bc3
LUCENE-7439: clean up FuzzyQuery/FuzzyTermsEnum sources
ASF subversion and git services (migrated from JIRA)
Commit 4db3e7b8a7ca818002af9041bf10660c25905915 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4db3e7b
LUCENE-7439: improve test case
ASF subversion and git services (migrated from JIRA)
Commit ba085d4c8487d43089295c7b145c77eba19497b4 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ba085d4
LUCENE-7439: back port test case to 6.x
Michael McCandless (@mikemccand) (migrated from JIRA)
Here's a patch fixing FuzzyQuery
to also accept small terms.
With the simplified FuzzyTermsEnum
it was quite simple to fix it (remove the while
loop), and to fix the test case to verify any term within the specified edit distance does match.
The one wrinkle is that such matches get a boost of 0.0, because the formula we use to compute the boost for a matched term (1.0 - editDistance / minTermLength
) can be <= 0. I think this is fair: such matches are poor quality compared to longer term matches.
ASF subversion and git services (migrated from JIRA)
Commit c7fb49d7b50d171c6d787253b9ab575218fef7fe in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7fb49d
LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short
ASF subversion and git services (migrated from JIRA)
Commit 471f90cf825ee3106fef1fa4c1094d0ca461e7fb in lucene-solr's branch refs/heads/branch_6x from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=471f90c
LUCENE-7439: FuzzyQuery now matches all terms within the specified edit distance, even if they are short
ASF subversion and git services (migrated from JIRA)
Commit c79d44f82814d6d798450a422f73f42891cb1ef5 in lucene-solr's branch refs/heads/master from Mike McCandless https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c79d44f
LUCENE-7439: move CHANGES entry
Shalin Shekhar Mangar (@shalinmangar) (migrated from JIRA)
Closing after 6.3.0 release.
Tim Allison (@tballison) (migrated from JIRA)
Talk about slow on the uptake (no pun intended)... faf3bc3 removed SlowFuzzyQuery, and I realize it has been deprecated for a long, long time. Is there any way to increase the maxEdits/MAXIMUM_SUPPORTED_DISTANCE to > 2?
Today, if you ask
FuzzyQuery
to matchabcd
with edit distance 2, it will fail to match the termab
even though it's 2 edits away.Its javadocs explain this:
On the one hand, I can see that this behavior is sort of justified in that 50% of the characters are different and so this is a very "weak" match, but on the other hand, it's quite unexpected since edit distance is such an exact measure so the terms should have matched.
It seems like the behavior is caused by internal implementation details about how the relative (floating point) score is computed. I think we should fix it, so that edit distance 2 does in fact match all terms with edit distance <= 2.
Migrated from LUCENE-7439 by Michael McCandless (@mikemccand), resolved Sep 15 2016 Attachments: LUCENE-7439.patch (versions: 3) Linked issues:
6270