Closed guojing0 closed 2 years ago
The annotations and bookmarks can indeed come in large numbers, say more than 10000 elements. In my experience, performance was an issue, which is why I reverted to Levenshtein. More on that in my next comment.
Indeed, a comment says we should "learn" a good value. To do this we would need relevant samples and tests. How do we get such samples? Not easy, because we may not want to share our bookmarks for privacy reasons :) Maybe there exists some good public samples?
Good point! Wanna send a patch?
About performance and the general question of input matching: I've recently played with Montezuma (https://github.com/sharplispers/montezuma) for Demeter and got excellent results at doing substring matching.
I have hope that Montezuma would advantageously replace all the code in filter.lisp
, except for the (mk-string-metrics:norm-damerau-levenshtein suggestion-string input)
line since Montezuma does not perform fuzzy-matching.
Performance also seems much better with Montezuma than with our ad-hoc implementation.
Fuzzy-matching is cool though, and Montezuma is extensible, so maybe it's possible to hook an mk-string-metrics
function (Levenshtein or Jaccard) into Montezuma and enjoy excellent quality input matching.
All this with hopefully maximal performance.
Thank you for the detailed answer, @Ambrevar. If that is the case, indeed Damerau-Levensthein would be a better fit.
Of course. I am currently writing tests for submatches
function and will submit this change in the same PR.
After creating said PR, I will look into Montezuma and how it could be integrated into our codebase to increase performance.
Montezuma is unevenly documented and has many idiosyncrasies, let me know if you need help.
I was thinking about writing functions/macros to help make write documentation and (user) manual easier, and then I found a TODO in fuzzy.lisp, which was about reverse fuzzy matching. I started to read prompter/filter.lisp and prompter/prompter-source.lisp, and had the following questions:
The Smith-Waterman algorithm seems to be implemented in Common Lisp.
score-threshold variable in Line 48 of filter.lisp: It seems to be a good threshold to filter unnecessary elements, but currently it's not used anywhere.
substring-norm function in Line 6 of filter.lisp: After reading the code and some testing, I think it may be a good idea to remove duplicates from
substrings
, so that each element shares equal weight?