Open walles opened 9 years ago
Interesting idea. Found the algo for CoffeeScript: http://rosettacode.org/wiki/Longest_common_subsequence#CoffeeScript
Will try to implement it in my PR
LCS alone will not help for the order of item in the screenshot.
We have LCS("git push","git push") = "git push"
But also LCS("git push","git plus hunk") = "git push"
What you need is to normalize matches against the length of both string. I myself have a project using LCS and use something like this:
m = Length of LCS(a,b)
lena = a.length
lenb = b.length
score = m*(m/lena+ m/lenb)
Note that you can probably use current algorithm to calculate m too. This is because current algorithm looks like a greedy approximation of LCS, with special score depending of the nature of the character.
By greedy i mean score("hzello","helloz")
will match the z
because it's in both sub-string but will miss the ello
part doing so. LCS will skip the z to match ello
If you want to see behavior of above scoring among other things, I have a demo page
Tried the LCS, it's way too slow.
Filtering 66672 entries for 'index' took 6557ms for 6168 results
Matching 66672 entries for 'index' took 237ms for 6168 results
Instead of
Filtering 66672 entries for 'index' took 267ms for 6168 results
Matching 66672 entries for 'index' took 218ms for 6168 results
Maybe I'm missing something...
I also added a test in my PR for the use case above. The issue is with using 2 words: "push" retrieve the correct result (see test) but "git push" doesn't :disappointed:
Well spotted @jeancroy, I see now that LCS won't handle my use case.
Your scoring function can be greatly simplified into "prefer shorter hits" if used to compare already-filtered hits. Since the hits are already filtered, we actually already have the LCS, and:
m
= the length of the LCS = constant between all optionslena
= the length of the LCS = m = constant between all optionslenb
= the only thing varying = the length of the candidate stringSo if we replace lena
with m
, we get...
score = m * (m/m + m/lenb) = m * (1 + m / lenb) = m + m^2/lenb
... and since m
is the same between all candidates its value won't affect comparisons and we can just as well replace with a constant of our choosing (I choose 1)...
score = 1 * (1 + 1^2/lenb) = 1 + 1/lenb
... and since adding the same constant to all values we compare doesn't change anything we can remove that too, giving us...
score = 1 / lenb
... or in other words prefer shorter hits.
Which I think sounds just fine actually :)
@plrenaudin, thanks for trying this and benchmarking it!
After the above discussion however I'm not sure this is actually the way to go any more, and I'm not convinced about pure substring matches either.
Real-world example, searching for install
using cmd-shift-p gives me:
A substring match would put "Find And Replace: Select All" at the bottom (good), but it would also consider "uninstall", "install" and "installed" as equally good (bad).
So I'm starting to think that the way to go about this is to:
That's probably better than starting with a solution like I suggested here initially.
You're right, People would also want to score starting substring with a higher score...
I'd like to have similar suggestion as IntelliJ or Sublime text but it would probably require to rewrite the whole scoring system.
We really need improvement on this because for projects with lots of files, the pertinence is just so bad that it's rare to have what we search for in first result.
@walles the part you simplified out is actually important for accuracy should you use LCS.
match("uni","university")
should be better than match("unicorn","university")
Now I because your scoring function have a no typo policy I'd agree with your simplification.
I have put something together that I believe account for some of the issues mentioned.
push
will not fully match plus husk
. Same for directory, SomeDir
will not match Awsome/David_R
install
will match prefix for second token of git install
but will not match prefix of git uninstall
@jeancroy Thanks for the idea! I like the first point (match the tokens instead of the entire query). Just tried to change the actual score function by your Gist but lots of tests are failling and benchamrks result is:
Filtering 66672 entries for 'index' took 1322ms for 2040 results
Matching 66672 entries for 'index' took 250ms for 2040 results
Results count changed! 2040 instead of 6168
It's missing some results so even if it worked, it would still be quite slow compared to the actual implementation.
But there is definitely a potential improvement in the tokenizing of the query / subject.
@plrenaudin can I play with your implementation of that benchmark? It should not miss an "index" so i'm curious to debug that, also query is not meant to be re-computed each time.
Also open question ... Why do this project score special character higher ? I mean any character of query not in subject squash the score to 0. Isn't not being squashed to 0 a good enough outcome ? Also every result not being 0 will have every char including every special char and their bonus, so it doesn't help the sort very much.
@jeancroy My "implementation" is just to change the score function by your gist, nothing more nothing less. This is where the magic should happen and the score put the relevant results in front of the rest.
I'm sure you could improve thist but it looks like the algo has much more complexity than the one in place so having a slower benchmark is not that surprising. I personally wouldn't mind having a slower fuzzy algo for more pertinence though.
@plrenaudin sorry I did not meant you have changed something in a wrong way. Just I never played with coffeescript / node only before today so I didn't know where to start. I corrected something in my code.
Filtering 66672 entries for 'index' took 877ms for 2040 results Matching 66672 entries for 'index' took 307ms for 2040 results Results count changed! 2040 instead of 6168
I'm also OK with matches that are missed. I believe any token implementation (idea 1) will miss them.
index vs ./node_modules/npm/node_modules/path-is-inside/LICENSE.txt
and some more
@jeancroy You didn't change anything in a wrong way, it just requires more changes to match the expected results of tests. I'm eager to see it fully compliant with the tests. When you look at the specs, most of them make sense so it is a good way to see if your algo is compatible. it just misses the tests about the "git push" query situation. (and maybe also a more complete test about CamelCase query)
It bothers me to see that for such an important feature there is so few people concerned in improving it.
@plrenaudin This issue has many "lurkers", I assure you. But I guess many, like myself, are very busy with other stuff. Wish I had time to invest in improving this...
I applaud the effort put into this so far by the way. Love it when the community (even if only a handful of people) come together like this :heart:
I just got off of a work bender ... pushing for release after my vacation time the last couple weeks. I'm still trying to catch up on all the work you all are doing. One thing I'd like to note though ...
I'd like to see more testing of "fuzzy" matches, just to make sure that all bases are covered, such as:
This is the way a lot of people search for things, not when they don't know what they're looking for ... but when they do. As another example, when I need to set the grammar of a file by hand, I use my set-syntax package which allows the Command Palette to mimic how I did things in Sublime:
The question then is: should ssrb match "Set Syntax: Ruby" or a class called "SsrbController" for example? I think at some point there will have to be some controversy about which result should score better than the other.
Definitely there will be. I just didn't want things getting completely optimized for whole or partial words when scattered single-character matches are also an important use case.
Indeed, I completely agree with the use case as for the one with CamelCaseFiles
Ok a new day, a new algorithm :)
This one is based on the idea of splitting the speed/accuracy trade off in two.
For 1, i simplified current algorithm will all score = 1. Output is match yes/no.. This mean new algorithm report same positive result than original.
For 2, i used a local sequence alignment algorithm with afine gap penalty (Smith Waterman / Gotoh). I also used the measure of similarity between characters to implement lower/upper case bonus, CamelCase and This is an acronym
What this mean in layman term
a. We get a lot of knob to play with. b. Algorithm will try to find the combination of matches that give the best score. c. I'm currently using that algorithm to highlight matches on another project, we we decide to use this as a scorer, I know it's relatively easy to make the matcher.
@plrenaudin
I've downloaded your updated filter.spec
. Current proposal pass all test but 4 "base path" ones. It's also faster on benchmark than last proposal mostly due to quick exit.
Nicely done @jeancroy Glad we are moving in a good direction! Can you test the ordering of the results? Something like: https://gist.github.com/plrenaudin/c8dace4edeb5ea58c949
added test, it pass. 0: "install" 1: "Application: Install Update" 2: "Settings View: Uninstall Packages" 3: "Find And Replace: Selet All"
Ok this version pass the whole suite. https://gist.github.com/jeancroy/a37b701659658f2529bf
Anyone has any other test case ?
Once we have cleared some other test/ideas I might do a PR.
@jeancroy, I'd like to have "Settings View: View Installed Themes" added to the "install" test.
@jeancroy congrats! I think it covers already a good bunch of them now.
@walles it now rank like so, make sens?
0: "install" 1: "Application: Install Update" 2: "Settings View: Uninstall Packages" 3: "Settings View: View Installed Themes" 4: "Find And Replace: Selet All"
I believe the rule that come into play is both "rank start of match higher" and "prefer smaller subject", there will be some point for "installed" because the proper letter "i" follow a token separator (space)
That's a question about exact case substring score vs start of the word score. Personally I would prefer a better score for the former one but it's already better that what we currently have. Would it be possible to have some benchmark to compare?
note that because the install test is made like so result = filter(candidates, 'install');
Uninstall have the exact case but not Installed
Filtering 66672 entries for 'index' took 658ms for 6168 results Matching 66672 entries for 'index' took 253ms for 6168 results
Yes, what I mean is the "exact case substring" score should not beat the "ignore case substring at start of the word" score. But that's totally personal, I'm already fine with the result :smiley:
Great stuff for the benchmark!
@jeancroy I think your last list seems splendid!
Also I think that a PR wouldn't need to solve everything, so as long as the "installed" test is in the PR I think it's a good (new) starting point for making more refinements from.
PR!
What do I do with credit from https://github.com/joshaven/string_score/
?
I'm not using any of this anymore. Basically the whole scorer.coffee is rewritten
I assume it's safe to remove it as for the line (https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12) if tests passes afterward
@lee-dohm I considered your example. They are particular in the sens the "take first concurrence of the matching char after previous match" algorithm magically align with word boundary. By magically I mean, I'm pretty sure someone with experience can craft a query that match, but Sublime, Chrome dev tool and others software setup a higher expectation of user-friendliness .
Consider this example
match 'acon' in 'application':
<a>ppli<c>ati<o><n>
Scattered letter is the best we can do
match 'acon' in 'application_control':
<a>pplication_<c><o><n>trol
Here we have to undo 3 greedy matches to get intended result. And it's in that case - where greedy doesn't align with bonus - that previous algorithm was giving non intuitive results.
For that I provide two mechanism:
<c>
let |
be a match, #
be a gap opening , -
be a (in match) gap extension, .
be a haystack gap
a#---c#--on
| | ||
application
so we have 2 opening 5 extension
a#----------con....
| |||
application_control
so we have 10 extension but only 1 opening
So basically:
So for me that is the main contribution of PR #22, to get extra information about the match structure. Once we have that information it's easier to fine-tune for particular cases.
But that information have a computing power cost, so we provide quick exit for low quality matches (same==0 rule as before) as well as trivial matches (for now this is exact substring and 100% camel case initialism)
To that I'll add that @plrenaudin wanted to nail the exact substring case and provided plenty of real life test case to the spec, his contribution became the base for the "exact substring" part. In addition to exact substring, CamelCase got a special handler to ensure we nail them.
Also @walles was very useful in asking the right questions, scouring issues for test cases, and providing a patch for a substantial speedup.
Lastly how does this relate to LCS ? If we set gap penalty to 0 and match weight to 1, we fall back to LCS, so it's really a more flexible generalization.
I'll also add that for the purpose of CamelCase detection, case invariant character count as lowecase
" " == " ".toLowerCase()
. So PR #22 should handle @lee-dohm screenshot without any problem
I suggest using LCS to rank matches: https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Solution_for_two_sequences
Matches with lower number of substrings would come before matches with higher number of substrings.
This would be a lot more generic than the substring() based approach suggested in some PRs.
If you want to play with this, here's an online thing demonstrating the concept (I didn't write this, just found it): http://lcs-demo.sourceforge.net/
Start by upping the "Max Size" a bit, then type in two strings and press "Execute LCS Lengths" to see how it works.
One real-world use case where this would come in handy is when searching for "git push", note how it's third place even though "git" and "push" are exact matches. Single-word matches would be caught by this as well.