TeX-Live / texdoc

Find and view documentation in TeX Live
https://tug.org/texdoc/
GNU General Public License v3.0
47 stars 7 forks source link

Better threshold for fuzzy search? #80

Open kberry opened 2 years ago

kberry commented 2 years ago

texdoc foobar finds float.pdf and other float and hep-float docs.

I see there are three characters in common, but it seems like too fuzzy of a match. But maybe I think that only because I'm a human (at least, I think I am; sometimes it seems a fuzzy question :). Just thought I'd mention ...

wtsnjp commented 2 years ago

Yes, I agree with you that "foobar" and "float" looks quite different for me as well, but to calculate the Levenshtein distance, it is just three (1. replace "o" with "l," 2. delete "b," 3. replace "r" with "t").

The distance of three is not so significant for long strings, but it is already too large for short strings, like in this case.

One of the solutions for this is using normalized Levenshtein distance. Let $q$ be a query and $c$ be a candidate match for $q$. There are several ways to normalize the distance, but one of the simplest ones can be calculated as follows:

$$d = \frac{\operatorname{Levenshtein}(q, c)}{\max(\operatorname{len}(q), \operatorname{len}(c))}$$

We can use a value for these types of Levenshtein distance variants for the threshold.

We have to consider only a small thing: we already have a config item of fuzzy_level which initially was expected to have a value for Levenshtein distance itself and can only take an integer value. We might want to maintain backward compatibility for this.

wtsnjp commented 1 year ago

We need an excellent calculation formula to convert an integer value of fuzzy_level to the threshold for a normalized Levenshtein distance. We want to make fuzzy_level roughly work the same for the keywords whose lengths are around six.

wtsnjp commented 1 year ago

Perhaps just using fuzzy_level / 10 as the threshold is OK.