Open garybernhardt opened 9 years ago
I'm not really happy with the way Selecta handles case currently. There are two distinct issues. The first issue is this: (Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.) The second issue is that in some languages, ignoring case completely actually throws out a lot of valuable information that could trivially be used for matching. This is best demonstrated by an example. Without case sensitivity:
With case sensitivity:
I think that a good solution here might be to have lowercase query characters match both uppercase and lowercase characters, but uppercase query characters only match uppercase characters in matches. There must be a term for this--partial case sensitivity? case covariance?--but I don't know what it is.
@rschmitt That's usually called "smart case", and it's pretty common in interactive search systems these days. E.g. built-in interactive search for both Emacs and Vim has had this for ages.
@jwhitley That rings a bell. I've gone ahead and implemented it in Heatseeker (https://github.com/rschmitt/heatseeker/commit/7a3aa4b67d03c12a070024b155adf0a5c20bb65a). So far it seems like an awesome improvement for languages that conventionally use CamelCase filenames--Java, Scala, Haskell, some C++, etc.
(Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.)
Have to agree with @rschmitt here. So far the only issue I ran into.
I prefer the new scoring.
I noticed that a query of amuser
will score 3 against banjo/app/models/user.rb
instead of 2, because the score count starts at the a
in banjo
instead of the a
in app
.
Most of the time I imagine selecta is used to match file paths. File paths aren't uniformly weighted; the tail of the path is more specific, in a way, than the head (big-endian?). Therefore I was wondering about matching from the more specific to the less specific, i.e. from right to left.
Clearly you can't just reverse the query and the choices and pass those to the scoring algorithm. I can't quite tell at the moment how to change the algorithm, and of course benchmarking might well rule it out. But I thought I'd mention it.
I see that the algorithm favor directory matches instead of file matches in certain conditions, here's an example of a chef project I'm working on:
Note that default.rb
is a closer (in terms of directory depth) than the other files and the input matches partially the file default.rb
but not the other ones, which led me to believe that it should take precedence.
PS: There are no more files in this example, all of them show up in this screenshot.
I think it's a general improvment, I'm still getting acquainted to the new behaviour, learning new "first hits", etc.
The boundary-aware matching hasn't worked as I expect in a few cases:
> selecta
gshutler/goselecta
garybernhardt/selecta
I would expect garybernhardt/selecta
to rank higher as selecta
starts after a boundary.
> core
./rspec-core
./cronofy/core
I would expect ./cronofy/core
to rank higher as core
starts after a harder boundary. I think of -_
as softer than /\
.
> ccore
./rspec-core
./cronofy/core
A variant of the above, but I would definitely expect ./cronofy/core
to rank higher as the first c
matches the leading c
of cronofy
and the trailing c
of rspec
.
> vepres
app/presenters/event_presenter.rb
app/presenters/v_event_presenter.rb
app/presenters/api_event_presenter.rb
I think this is similar to the case @airblade mentioned. I'm expecting [v]_[e]vent_[pres]enter.rb
to be chosen but it's using v_e[ve]nt_[pres]enter.rb
. I think that's because it's the shorter substring. The only way to avoid this would be to evaluate all possible matches to find the best score which would be slower.
If a primary use case of selecta is selecting files, then I think that matches "further" into the strings should have more weight, as the "deeper" you go the more specific the match is to that string.
It might help if I give an example of where this approach definitely works.
Imagine you've got a Rails project-like structure:
app/controllers/application_controller.rb
app/controllers/special_controller.rb
spec/controllers/application_controller_specs.rb
spec/controllers/special_controller_specs.rb
This splits on boundaries into something like:
[app, controllers, application, controller, rb]
[app, controllers, special, controller, rb]
[spec, controllers, application, controller, specs rb]
[spec, controllers, special, controller, specs rb]
When I search for something like appcon
I would expect the results:
app/controllers/[app]lication_[con]troller.rb
spec/controllers/[app]lication_[con]troller_specs.rb
[app]/[con]trollers/special_controller.rb
Currently we get:
[app]/[con]trollers/special_controller.rb
[app]/[con]trollers/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_specs.rb
If I refine the search to appcons
I would expect the results:
spec/controllers/[app]lication_[con]troller_[s]pecs.rb
[app]/[con]trollers/[s]pecial_controller.rb
[app]/[con]troller[s]/application_controller.rb
Currently we get:
[app]/[con]troller[s]/special_controller.rb
[app]/[con]troller[s]/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_[s]pecs.rb
I hope that's in some way useful.
The UI now prints paths with the correct case; that was a silly little bug.
I think that smart case seems like a good idea, but it sounds hairy and I want to put it off for a bit since it should be independent of these recent algorithm changes.
Comments on left-vs-right in a moment.
I see two possible adjustments for left vs. right matching:
I think that (1) should definitely be done, but (2) may not be worth it.
Comments on specific matching examples in yet another moment...
In @airblade's example of querying "banjo/app/models/user.rb" for "amuser", the score is 3 because the first character isn't considered for purposes of the boundary and sequential character bonuses. It definitely should be, but I didn't see an obvious way to implement it that way, so I cowardly punted on it.
For @gshutler's examples, in order:
I think that we should:
I've made the scoring algorithm smarter about sequential matching characters and word boundaries (to improve results when querying for acronyms). It's merged to master, along with some other changes, in d874c99dd7f0f94225a95da06fc487b0fa5b9edc. The README contains a summary (search it for "algorithm").
I'd love to hear feedback from actual Selecta users, especially after you've used it on actual projects.