Open kberry opened 2 years ago
Yes, I agree with you that "foobar" and "float" looks quite different for me as well, but to calculate the Levenshtein distance, it is just three (1. replace "o" with "l," 2. delete "b," 3. replace "r" with "t").
The distance of three is not so significant for long strings, but it is already too large for short strings, like in this case.
One of the solutions for this is using normalized Levenshtein distance. Let $q$ be a query and $c$ be a candidate match for $q$. There are several ways to normalize the distance, but one of the simplest ones can be calculated as follows:
$$d = \frac{\operatorname{Levenshtein}(q, c)}{\max(\operatorname{len}(q), \operatorname{len}(c))}$$
We can use a value for these types of Levenshtein distance variants for the threshold.
We have to consider only a small thing: we already have a config item of fuzzy_level
which initially was expected to have a value for Levenshtein distance itself and can only take an integer value. We might want to maintain backward compatibility for this.
We need an excellent calculation formula to convert an integer value of fuzzy_level
to the threshold for a normalized Levenshtein distance. We want to make fuzzy_level
roughly work the same for the keywords whose lengths are around six.
Perhaps just using fuzzy_level / 10
as the threshold is OK.
texdoc foobar finds float.pdf and other float and hep-float docs.
I see there are three characters in common, but it seems like too fuzzy of a match. But maybe I think that only because I'm a human (at least, I think I am; sometimes it seems a fuzzy question :). Just thought I'd mention ...