Some questions on `fuzzy-match` and relevant functions

guojing0 commented 2 years ago

I was thinking about writing functions/macros to help make write documentation and (user) manual easier, and then I found a TODO in fuzzy.lisp, which was about reverse fuzzy matching. I started to read prompter/filter.lisp and prompter/prompter-source.lisp, and had the following questions:

score-suggestion-string function in Line 43 of filter.lisp: Currently we use Damerau-Levensthein for scoring. The comment and TODO suggests the Jaccard metric or fzf's algorithm (I found out it used some modified version of Smith-Waterman algorithm). Its use cases seem to be only in annotate.lisp and bookmark.lisp. Would the input usually be very large, if not, maybe we could use the Jaccard metric to obtain better results?

The Smith-Waterman algorithm seems to be implemented in Common Lisp.

score-threshold variable in Line 48 of filter.lisp: It seems to be a good threshold to filter unnecessary elements, but currently it's not used anywhere.
substring-norm function in Line 6 of filter.lisp: After reading the code and some testing, I think it may be a good idea to remove duplicates from substrings, so that each element shares equal weight?

Ambrevar commented 2 years ago

The annotations and bookmarks can indeed come in large numbers, say more than 10000 elements. In my experience, performance was an issue, which is why I reverted to Levenshtein. More on that in my next comment.
Indeed, a comment says we should "learn" a good value. To do this we would need relevant samples and tests. How do we get such samples? Not easy, because we may not want to share our bookmarks for privacy reasons :) Maybe there exists some good public samples?
Good point! Wanna send a patch?

Ambrevar commented 2 years ago

About performance and the general question of input matching: I've recently played with Montezuma (https://github.com/sharplispers/montezuma) for Demeter and got excellent results at doing substring matching.

I have hope that Montezuma would advantageously replace all the code in filter.lisp, except for the (mk-string-metrics:norm-damerau-levenshtein suggestion-string input) line since Montezuma does not perform fuzzy-matching.

Performance also seems much better with Montezuma than with our ad-hoc implementation.

Fuzzy-matching is cool though, and Montezuma is extensible, so maybe it's possible to hook an mk-string-metrics function (Levenshtein or Jaccard) into Montezuma and enjoy excellent quality input matching.

All this with hopefully maximal performance.

guojing0 commented 2 years ago

Thank you for the detailed answer, @Ambrevar. If that is the case, indeed Damerau-Levensthein would be a better fit.

Of course. I am currently writing tests for submatches function and will submit this change in the same PR.

After creating said PR, I will look into Montezuma and how it could be integrated into our codebase to increase performance.

Ambrevar commented 2 years ago

Montezuma is unevenly documented and has many idiosyncrasies, let me know if you need help.

atlas-engineer / nyxt

Some questions on `fuzzy-match` and relevant functions #1911