atom / fuzzaldrin

Fuzzy filtering and string scoring
MIT License
319 stars 28 forks source link

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Closed jeancroy closed 9 years ago

jeancroy commented 9 years ago

This will be available for testing as an option in fuzzy-finder

Screen

Please address problems, comments, and suggestions here: https://github.com/jeancroy/fuzzaldrin-plus

We start a tuning phase, so anything not quite right is welcome.

plrenaudin commented 9 years ago

Don't forget that Travis is running on Linux and AppVeyor on Windows -> Different path separators. (hopefully it's the problem here)

walles commented 9 years ago

Maybe include the test case from #18 in the spec?

walles commented 9 years ago

And the test case from https://github.com/atom/atom/issues/7783 perhaps?

plrenaudin commented 9 years ago

Brilliant! I hope this will get through!

jeancroy commented 9 years ago

One thing I miss by a long shot is this spec: https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6

It only pass because that bypass https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18

Which with the new scoring scheme would be considered a bug. I'd ask permission to change the perfect==2 spec if possible.

Reason is that the various bonus, make it very hard to compute a meaninfull upperbound, especially things that act in the middle of the string like CamelCase, position in string bonus etc.

Also I believe in this project usage a accurate relative score ordering matter more than saying I have a 85% match.

jeancroy commented 9 years ago

I'd also ask permission to remove this line https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12

So if one search for "git push", i get token information and be able to do exact match for "git" then for "push". Or even simply for "git push" (because "gitpush" will never exact match "git push")

mnquintana commented 9 years ago

/cc @kevinsawicki

walles commented 9 years ago

@jeancroy, regarding score-spec.coffee, I think the only thing that makes sense is to compare relative scores, and possibly 0 scores (if that's the minimum).

What we want tests for is sort order, and absolute comparisons like the one at https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6 doesn't really do that, and IMO you should just remove that test together with the bypass at https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18.

And I'd apply similar reasoning for https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12; if you can remove it and honestly say that any specs that start failing aren't relevant any more, go for it!

Those lines are there to solve a problem, and if that problem isn't there any more neither should those lines.

plrenaudin commented 9 years ago

+1 on what @walles said

jeancroy commented 9 years ago

@walles I really appreciate what you do. Despite contributing a considerable amount of code, I'm still learning etiquette of open source.

walles commented 9 years ago

@jeancroy :)

OTOH, all these things are things I'd have said at work as well.

And I do appreciate you both coding and listening. Thanks @jeancroy!

The thing is also that I thought about diving into this myself, but since you're already doing such a great job with it I think my time is better spent reviewing.

plrenaudin commented 9 years ago

Great job guys, I'm currently testing the fuzzy-finder using this PR. As expected, it's quite slow but OK IMHO. Just noticed that the highlight of the character is off when there is space in the query. Edit: also when there is almost no match, I have some files with highlights and others without. Edit2: also, the CamelCase didn't work completely. I type MCC (upper case) for MyCamelCase and the results somethingmccsomething (lower case) scores higher. MyCamelCase result is just after though.

walles commented 9 years ago

This patch (avoids many calls to toLowerCase()) improves the benchmark results for this PR by about 2x:

diff --git a/src/scorer.coffee b/src/scorer.coffee
index bbbc1f2..bf54696 100644
--- a/src/scorer.coffee
+++ b/src/scorer.coffee
@@ -49,6 +49,9 @@ opt_char_re = /[\ \-\_\xE3\xE0\xE1\xE4\xE2\xE6\u1EBD\xE8\xE9\xEB\xEA\xEC\xED\xEF
 #

 exports.score = score = (subject, query, ignore) ->
+  subject_l = subject.toLowerCase()
+  query_l = query.toLowerCase()
+
   m = query.length + 1
   n = subject.length + 1

@@ -78,7 +81,7 @@ exports.score = score = (subject, query, ignore) ->
       # Score the options
       gapA = gapArow[j] = Math.max(gapArow[j] + we, vrow[j] + wo)
       gapB = Math.max(gapB + we, vrow[j - 1] + wo)
-      align = vd + scoreChar(query, subject, i - 1, j - 1)
+      align = vd + scoreChar(query, query_l, subject, subject_l, i - 1, j - 1)
       vd = vrow[j]

       #Get the best option
@@ -95,7 +98,7 @@ exports.score = score = (subject, query, ignore) ->
   lpos = m - n - 1

   #sustring bonus, start of string bonus
-  if ( p = subject.toLowerCase().indexOf(query.toLowerCase())) > -1
+  if ( p = subject_l.indexOf(query_l)) > -1
     vmax += wex * m * (1.0 + 5.0 / (5.0 + p))

     #sustring happens right after a separator (prefix)
@@ -108,11 +111,13 @@ exports.score = score = (subject, query, ignore) ->

   return vmax

-scoreChar = (query, subject, i, j) ->
+scoreChar = (query, query_l, subject, subject_l, i, j) ->
   qi = query[i]
+  qi_l = query_l[i]
   sj = subject[j]
+  sj_l = subject_l[j]

-  if qi.toLowerCase() == sj.toLowerCase()
+  if qi_l == sj_l

     #Proper casing bonus
     bonus = if qi == sj then wc else 0
@@ -129,13 +134,14 @@ scoreChar = (query, subject, i, j) ->

     #get previous char
     prev_s = subject[j - 1]
+    prev_s_l = subject_l[j - 1]
     prev_q = query[i - 1]

     #match FOLLOW a separator
     return wa + bonus if ( prev_s of sep_map) or ( prev_q of sep_map )

     #match IS Capital in camelCase (preceded by lowercase)
-    return wa + bonus if (sj == sj.toUpperCase() and prev_s == prev_s.toLowerCase())
+    return wa + bonus if (prev_s == prev_s_l and sj == sj.toUpperCase())

     #normal Match, add proper case bonus
     return wm + bonus
jeancroy commented 9 years ago

Hi @walles I like your idea, but I moved the lowercase test out of scoreChar because it was getting a bit ugly with all those arguments, and the whole function was basically an large if lowercase match. Now we gate the call to this function with the lowercase test.

walles commented 9 years ago

I agree your version is more readable. Nice!

plrenaudin commented 9 years ago

One last cherry on the top of the cake would be that an uppercase query would match the camelcase instead of the starting word. For example:

diff --git a/spec/filter-spec.coffee b/spec/filter-spec.coffee
index cabb87e..545dc1a 100644
--- a/spec/filter-spec.coffee
+++ b/spec/filter-spec.coffee
@@ -151,10 +151,11 @@ describe "filtering", ->
       'FilterFactors.styl',
       'FilterFactors.html',
       'FilterFactorTests.html',
-      'SpecFilterFactors.js'
+      'SpecFilterFactors.js',
+      'fftOtherClass.js'
     ]
     expect(bestMatch(candidates, 'FFT')).toBe 'FilterFactorTests.html'
-    expect(bestMatch(candidates, 'fft')).toBe 'FilterFactorTests.html'
+    expect(bestMatch(candidates, 'fft')).toBe 'fftOtherClass.js'

this test should pass.

But again, I'm fine with the current PR, it would be even better with this :+1:

jeancroy commented 9 years ago

@plrenaudin I do not understand your proposal. You speak of uppercase query, then you put some lowecase test.

Personally i like behavior of fft->FilterFactorTests because i seldom capitalize my search.

can you find another test for your request?

plrenaudin commented 9 years ago

yeah sorry it can be misleading I see. Here is a failing test illustrating my idea:

candidates = [
      'CamelCaseClass.js',
      'cccManager.java'
    ]
    expect(bestMatch(candidates, 'CCC')).toBe candidates[0]
    expect(bestMatch(candidates, 'ccc')).toBe candidates[1]

Second test is passing but not the first

jeancroy commented 9 years ago

Done :) added your test changed a 2 for a 3. As I said there will be some weight tuning as we get more real life test.

plrenaudin commented 9 years ago

@jeancroy you're amazing :star:

walles commented 9 years ago

@plrenaudin said that "the highlight of the character is off when there is space in the query".

@plrenaudin do you have a concrete example for that?

If so, that should be added to the specs (and the problem fixed of course).

plrenaudin commented 9 years ago

That was for my previous PR, it seems to be fine for this one. search for abc was returning anotherStringabc

jeancroy commented 9 years ago

yes that line of original https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L32

It remove characters, so index point to the wrong ones. I removed it from this build.

Rigth now however highlight is off vs scorer (ie scorer algorithm will identify match of better quality than highlight). Once this PR pass I might work on the highlight (which really is just storing which of gapA, gapB, align was the best for each [i,j] and doing a traceback from the end.)

Soleone commented 9 years ago

would love to see progress on this issue, see also https://github.com/atom/fuzzy-finder/issues/57#issuecomment-133531653

jesseleite commented 9 years ago

@jeancroy: I just cloned your fork and ran apm link. It symlinked into my packages folder as I'd expect, and I see your commits in there. I tried restarting Atom, but I'm still seeing Git Plus: Stage Hunk instead of Git Plus: Push. What am I doing wrong?

50Wliu commented 9 years ago

@JesseLeite This isn't an Atom package (it's actually an npm package that Atom installs), so apm linking it will have no effect.

jesseleite commented 9 years ago

Ah I see.

jeancroy commented 9 years ago

I understand this is hard to review because it's closer to an complete rewrite than a single line fix. To my defense the previous package was a thin wrapper around string_score, and changing that algorithm was kinda necessary to consider other scoring than the leftmost one. And there was almost 70 commit to get current state of things.

Anyways I'll be happy to work with someone from atom to help this or something similar pass.

plrenaudin commented 9 years ago

It will soon be 2 months... But there is time until "around the beginning of Sept." (https://github.com/atom/atom/issues/7995) so we can hope to see it merged next week or so?

@JesseLeite If you want to try this PR, you can by running the package you want to try it on (fuzzy-finder, command-palette, autocomplete etc...) in dev mode (apm dev <package name> and link their fuzzaldrin implementation to your fork (change the node_module/fuzzaldrin folder to redirect to yours).

off-by-some commented 9 years ago

:+1: Would love to have this merged in

jesseleite commented 9 years ago

Looks incredible @jeancroy! I'm looking forward to Fuzzy Finder improvements in Atom. It's the only feature I really miss from Sublime Text.

jeancroy commented 9 years ago

Honestly there is a lot of work done on this PR. But I feel it's a double edged sword. A substantial update like this might scare a would-be reviewer. I'm trying to help convince this is worth it.

Will bump this thread from time to time with technical documentation to help future reviewer. Whit the idea that such documentation might be moved to the project wiki or something.

Soleone commented 9 years ago

:heart: thanks from me as well for moving this forward, it's the top priority imho, everything else about my switch from sublime feels amazing!

if you can document some steps to easily test this out locally to provide general feedback about the behavior then that might go a long way in getting this reviewed/shipped .

Soleone commented 9 years ago

@jeancroy on another note, maybe add everything relevant to reviewing/testing this PR to the description at the top so that reviewers don't have to scroll to the whole comment history

jesseleite commented 9 years ago

thanks from me as for moving this forward, it's the top priority imho, everything else about my switch from sublime feels amazing!

I agree 100%. Maybe we are biased Sublime users, but the scoring in Sublime Text's Fuzzy File Finder (CMD-P) feels so good. Even the fuzziest queries seem to return a very smart result set. In Atom, I find myself narrowing the results by typing out the path and filename, just to push my desired result into view or up the list. It seems the "fuzzy" part is very hit and miss right now.

jeancroy commented 9 years ago

Sublime text definitively is the mark to beat :) I'll be happy for any example where this is lower quality than Sublime, not because it's a bug but because we still need have some tuning to do.

I've moved my explanations to the first post, also posting a step-by-step to test this PR

Soleone commented 9 years ago

thanks for that! i just did a brief test in a huge 25,000+ file rails project and wow is the experience so much better at the moment, e.g.:

trying to open app/models/notifications/notification.rb:

benogle commented 9 years ago

A substantial update like this might scare a would-be reviewer. I'm trying to help convince this is worth it.

This is the key right now. Someone from the core team would need to dedicate real time and review this in a pretty in depth way. Right this minute, people are focused on other performance related work. I do agree that the fuzzy matching should be improved, but this might take us a bit to get to.

The enormity of this change could also trigger a swath of new issues as we would be changing behavior that has been there from the beginning. We've seen this several times where a bunch of people want a fix, we change the behavior, and new people come out of the woodwork describing the pain the change caused (an example). So we need to be careful about something like this.

The best way to handle this from a user perspective would be a setting, and maybe just scoped to fuzzy finder at first. Then get core people and the maintainers on it to make sure it's the right direction and to flesh out edge cases + other weirdness. This is suuuuuper nuanced, and it's tough to account for all the cases people in the wild will encounter + with this many code changes, there will be bugs.

jeancroy commented 9 years ago

A counter argument of that is that is for everything that is different the overall effect is limited to ranking, and possibly speed.

You can also mathematically proof that the current left-to-right approach cannot support most the behavior as seen in other popular fuzzy finder. Add to this that this library has seen no update but cosmetic one since last year. At one point some type of solution reach their peak.

benogle commented 9 years ago

I'm not saying the current behavior is the way, or that the current implementation is the way. I just want to be clear that we need to be really careful with changes of this nature. We would be changing a core behavior in several places, and the magnitude of code change is very large. There will be bugs and unhappy people, so we need to minimize pain as much as we can. It just takes time, and our main focus is currently on other things.

jeancroy commented 9 years ago

Would it be the proper way to have a kind of fuzzaldrin-plus package then have option to switch from one to the other like people change provider from fuzzy/symbol in auto-complete ?

benogle commented 9 years ago

Would it be the proper way to have a kind of fuzzaldrin-plus package

Yes, maybe make a new node module, and make a PR on fuzzy-finder to add an option to use the new module. Then we can try it out for a while. I say we just do fuzzy-finder for now, and try it for a week or two.

jesseleite commented 9 years ago

It just takes time, and our main focus is currently on other things.

I can't say much regarding the details of this fuzzy finder rewrite, but I can say this. The current fuzzy finder indexing on panel open is distractingly slow, and the current scoring algorithm has me missing Sublime more and more. My +1 for making fuzzy finder a bigger priority :'(

All that said, I don't want to be a downer. Atom is mostly incredible, so thank you all for your work on Atom.

kshnurov commented 9 years ago

"We don't have time for this" and "it's not our priority" after 2 months of waiting and so much people asking. That's why I hate pseudo-opensource - it doesn't matter that you can see the code if you can't fix it.

jesseleite commented 9 years ago

@kshnurov Hey now... Even open source needs direction. Atom shouldn't blindly merge changes in, especially overhauls like this. If you care about the future of the editor, a core team reviewing PRs is smart. The fact that @jeancroy is doing all this work is an amazing community contribution, and the fact that Atom's core team is honest about their stance is also a good thing.

All that said, I hope fuzzy finder doesn't get ignored for too long! It's one of Atom's biggest flaws IMO.

jeancroy commented 9 years ago

Honestly 2 weeks of test as an optional feature is not that bad of a compromise. Personally I was going for at least bumping to a major version.

kshnurov commented 9 years ago

@JesseLeite users feedback should be taken into account when the direction is chosen, not ignored. Core team reviewing PRs is smart, core team saying "we won't even review this" on the ready PR for one of the biggest flaw (and you're 100% right on that) is not smart at all.

I've seen that a dozens of times - it starts with "it's not our priority" on a feature/PR everyone asks for and ends up with a dead software no one uses.

jesseleite commented 9 years ago

@JesseLeite users feedback should be taken into account when the direction is chosen, not ignored.

@kshnurov: Absolutely, but I don't think they are saying "we won't even review this".

Yes, maybe make a new node module, and make a PR on fuzzy-finder to add an option to use the new module. Then we can try it out for a while. I say we just do fuzzy-finder for now, and try it for a week or two.

^ Seems to me that they are open to the PR, but not ready to skim and merge without thorough review.

This PR is a behemoth. My 1st beef with Atom is the fuzzy finder. BUT I also want any solution(s) to be well tested and properly implemented, otherwise we're just layering bandaids on top of bandaids. Be a bit more positive and trust the process :)

benogle commented 9 years ago

but I don't think they are saying "we won't even review this".

Definitely not. I think getting it in as an option is the first step to changing this.

jeancroy commented 9 years ago

@benogle I'm a bit stuck at the step where fuzzy finder don't use fuzzaldrin for filtering/scoring Instead it rely on atom-space-pen-views which has non optional call to the current version.

https://github.com/atom/fuzzy-finder/blob/master/lib/fuzzy-finder-view.coffee#L135 https://github.com/atom/atom-space-pen-views/blob/master/src/select-list-view.coffee#L161

Should I bypass the super ?