Should we keep the new boundary-aware scoring algorithm?

garybernhardt commented 9 years ago

I've made the scoring algorithm smarter about sequential matching characters and word boundaries (to improve results when querying for acronyms). It's merged to master, along with some other changes, in d874c99dd7f0f94225a95da06fc487b0fa5b9edc. The README contains a summary (search it for "algorithm").

I'd love to hear feedback from actual Selecta users, especially after you've used it on actual projects.

rschmitt commented 9 years ago

I'm not really happy with the way Selecta handles case currently. There are two distinct issues. The first issue is this: screen shot 2015-03-02 at 4 53 04 pm (Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.) The second issue is that in some languages, ignoring case completely actually throws out a lot of valuable information that could trivially be used for matching. This is best demonstrated by an example. Without case sensitivity: screen shot 2015-03-02 at 4 57 03 pm

With case sensitivity: screen shot 2015-03-02 at 4 56 18 pm

I think that a good solution here might be to have lowercase query characters match both uppercase and lowercase characters, but uppercase query characters only match uppercase characters in matches. There must be a term for this--partial case sensitivity? case covariance?--but I don't know what it is.

jwhitley commented 9 years ago

@rschmitt That's usually called "smart case", and it's pretty common in interactive search systems these days. E.g. built-in interactive search for both Emacs and Vim has had this for ages.

rschmitt commented 9 years ago

@jwhitley That rings a bell. I've gone ahead and implemented it in Heatseeker (https://github.com/rschmitt/heatseeker/commit/7a3aa4b67d03c12a070024b155adf0a5c20bb65a). So far it seems like an awesome improvement for languages that conventionally use CamelCase filenames--Java, Scala, Haskell, some C++, etc.

wpp commented 9 years ago

(Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.)

Have to agree with @rschmitt here. So far the only issue I ran into.

airblade commented 9 years ago

I prefer the new scoring.

I noticed that a query of amuser will score 3 against banjo/app/models/user.rb instead of 2, because the score count starts at the a in banjo instead of the a in app.

Most of the time I imagine selecta is used to match file paths. File paths aren't uniformly weighted; the tail of the path is more specific, in a way, than the head (big-endian?). Therefore I was wondering about matching from the more specific to the less specific, i.e. from right to left.

Clearly you can't just reverse the query and the choices and pass those to the scoring algorithm. I can't quite tell at the moment how to change the algorithm, and of course benchmarking might well rule it out. But I thought I'd mention it.

mavcunha commented 9 years ago

I see that the algorithm favor directory matches instead of file matches in certain conditions, here's an example of a chef project I'm working on:

screen region 2015-03-03 at 11 12 49

Note that default.rb is a closer (in terms of directory depth) than the other files and the input matches partially the file default.rb but not the other ones, which led me to believe that it should take precedence.

PS: There are no more files in this example, all of them show up in this screenshot.

gshutler commented 9 years ago

I think it's a general improvment, I'm still getting acquainted to the new behaviour, learning new "first hits", etc.

The boundary-aware matching hasn't worked as I expect in a few cases:

> selecta
gshutler/goselecta
garybernhardt/selecta

I would expect garybernhardt/selecta to rank higher as selecta starts after a boundary.

> core
./rspec-core
./cronofy/core

I would expect ./cronofy/core to rank higher as core starts after a harder boundary. I think of -_ as softer than /\.

> ccore
./rspec-core
./cronofy/core

A variant of the above, but I would definitely expect ./cronofy/core to rank higher as the first c matches the leading c of cronofy and the trailing c of rspec.

> vepres
app/presenters/event_presenter.rb
app/presenters/v_event_presenter.rb
app/presenters/api_event_presenter.rb

I think this is similar to the case @airblade mentioned. I'm expecting [v]_[e]vent_[pres]enter.rb to be chosen but it's using v_e[ve]nt_[pres]enter.rb. I think that's because it's the shorter substring. The only way to avoid this would be to evaluate all possible matches to find the best score which would be slower.

If a primary use case of selecta is selecting files, then I think that matches "further" into the strings should have more weight, as the "deeper" you go the more specific the match is to that string.

It might help if I give an example of where this approach definitely works.

Imagine you've got a Rails project-like structure:

app/controllers/application_controller.rb
app/controllers/special_controller.rb
spec/controllers/application_controller_specs.rb
spec/controllers/special_controller_specs.rb

This splits on boundaries into something like:

[app, controllers, application, controller, rb]
[app, controllers, special, controller, rb]
[spec, controllers, application, controller, specs rb]
[spec, controllers, special, controller, specs rb]

When I search for something like appcon I would expect the results:

app/controllers/[app]lication_[con]troller.rb
spec/controllers/[app]lication_[con]troller_specs.rb
[app]/[con]trollers/special_controller.rb

Currently we get:

[app]/[con]trollers/special_controller.rb
[app]/[con]trollers/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_specs.rb

If I refine the search to appcons I would expect the results:

spec/controllers/[app]lication_[con]troller_[s]pecs.rb
[app]/[con]trollers/[s]pecial_controller.rb
[app]/[con]troller[s]/application_controller.rb

Currently we get:

[app]/[con]troller[s]/special_controller.rb
[app]/[con]troller[s]/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_[s]pecs.rb

I hope that's in some way useful.

garybernhardt commented 9 years ago

The UI now prints paths with the correct case; that was a silly little bug.

I think that smart case seems like a good idea, but it sounds hairy and I want to put it off for a bit since it should be independent of these recent algorithm changes.

Comments on left-vs-right in a moment.

garybernhardt commented 9 years ago

I see two possible adjustments for left vs. right matching:

Favoring strings where the match occurs farther to the right (whether we judge by start point, end point, median, whatever). This is a sorting issue; the scoring stay the same. This is easy to add.
Actually matching from right-to-left: start with the rightmost character of the query, then move left through the candidate string. The algorithm is greedy and not fully general; otherwise it would be O(2^n). Matching from right to left is doable, and I just implemented it, but it does add quite a bit of complication.

I think that (1) should definitely be done, but (2) may not be worth it.

Comments on specific matching examples in yet another moment...

garybernhardt commented 9 years ago

In @airblade's example of querying "banjo/app/models/user.rb" for "amuser", the score is 3 because the first character isn't considered for purposes of the boundary and sequential character bonuses. It definitely should be, but I didn't see an obvious way to implement it that way, so I cowardly punted on it.

For @gshutler's examples, in order:

"garybernhardt/selecta" should rank higher than "gshutler/goselecta" for "selecta"; this will happen with the above change.
Favoring "./cronofy/core" over "./rspec-core" for "core" due to a "/" before "core": this is specific to file paths. In other input text, this difference may not matter may even be reversed, so I'd like to avoid this to keep Select input-agnostic.
Favoring "./cronofy/core" over "./rspec-core" for "ccore" due to the "c" starting a word: like (1), this will be fixed with the above change.
I think that you're right: this would require solving the general case, which is O(2^n). There are some middle-ground solutions where you limit how far you'll look. They're quite a bit more complex than the greedy algorithm because you have to keep track of all of that stuff.

I think that we should:

Make the boundary/sequential special cases apply to the first matching character as well, which will fix three of the five cases mentioned above.
Depending on how that goes, maybe also add a sub-sort that favors matches to the right of the string. (This has some problems. It requires that the scoring algorithm choose the rightmost matching substring if there are multiple substrings tied for a score. It's also somewhat specific to file paths, which I don't like, but it doesn't affect the actual scores, so the impact will be small unless there are many results with identical scores.)

garybernhardt / selecta

Should we keep the new boundary-aware scoring algorithm? #80