cantino / mcfly

Fly through your shell history. Great Scott!
MIT License
6.87k stars 177 forks source link

Fuzzy searching, all-in-SQLite version #103

Closed dmfay closed 3 years ago

dmfay commented 3 years ago

A quarter-second faster than the skim version at ~100k entries; the one thing this needs is highlight detection, which still only looks for a contiguous match.

cantino commented 3 years ago

Thanks! I'll play with this shortly :)

cantino commented 3 years ago

Sorry for the delay :)

This seems plenty fast enough. It'd be really nice if it could prioritize longer matches somehow, though. ...I'm not sure how best to do that using the % approach.

dmfay commented 3 years ago

No worries, there's a lot happening right now 😬

I'll look at fixing up the highlighting soon but here's how you'd prioritize longer matches in general, for results having the same rank:

diff --git a/src/history/history.rs b/src/history/history.rs
index 6c2eca1..cea1d2f 100644
--- a/src/history/history.rs
+++ b/src/history/history.rs
@@ -247,7 +247,8 @@ impl History {
                                   selected_occurrences_factor, occurrences_factor
                            FROM contextual_commands
                            WHERE cmd LIKE (:like)
-                           ORDER BY rank DESC LIMIT :limit";
+                           ORDER BY rank DESC, length(cmd) DESC
+                           LIMIT :limit";
         let mut statement = self
             .connection
             .prepare(query)

However, I don't think it's a good idea. Especially with fuzzy searching, shorter matches get filtered out naturally as you continue to type, so prioritizing longer matches makes the shorter all but unreachable by fuzzy searching alone.

$ cruns

cargo run -- search --fuzzy
cargo run -- search

If anything, I'd want the opposite, so cargo run -- search would be the top result until you entered an f or a z.

cantino commented 3 years ago

Agreed, that shorter commands (or longer contiguous substring matches?) would be better. Since the rank from Mcfly is a float, though, I think the length would almost never be used in your implementation.

On Thu, Oct 15, 2020 at 6:11 AM Dian Fay notifications@github.com wrote:

No worries, there's a lot happening right now 😬

I'll look at fixing up the highlighting soon but here's how you'd prioritize longer matches in general, for results having the same rank:

diff --git a/src/history/history.rs b/src/history/history.rs

index 6c2eca1..cea1d2f 100644

--- a/src/history/history.rs

+++ b/src/history/history.rs

@@ -247,7 +247,8 @@ impl History {

                               selected_occurrences_factor, occurrences_factor

                        FROM contextual_commands

                        WHERE cmd LIKE (:like)
  • ORDER BY rank DESC LIMIT :limit";

  • ORDER BY rank DESC, length(cmd) DESC

  • LIMIT :limit";

     let mut statement = self
    
         .connection
    
         .prepare(query)

However, I don't think it's a good idea. Especially with fuzzy searching, shorter matches get filtered out naturally as you continue to type, so prioritizing longer matches makes the shorter all but unreachable by fuzzy searching alone.

$ cruns

cargo run -- search --fuzzy

cargo run -- search

If anything, I'd want the opposite, so cargo run -- search would be the top result until you entered an f or a z.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/cantino/mcfly/pull/103#issuecomment-709314702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUO623Y5MWFJIGV64YPC3SK3YJHANCNFSM4R5DKHVA .

dmfay commented 3 years ago

Serves me right for not checking; you could order by rank rounded to int, then length (ascending), then the full rank, but that's getting a bit kludgy.

Weighting contiguous characters higher would be a good next step after basic fuzzy searching. That's beyond the capabilities of SQLite so it'd have to happen after results are returned. Results near the bottom could have worse quality than those that didn't get in under the LIMIT too, but that's not much of a problem in practice.

cantino commented 3 years ago

I'd be fine merging this as-is if we could fix highlighting, and then we can always work on improving the sorting.

dmfay commented 3 years ago

This implementation could be smartened up a little bit -- right now clr matches a[cceler]ate instead of ac[celer]ate -- but it works, it's correct, and if moved a bit earlier end - start is exactly what we need to weight closer/contiguous matches.

dmfay commented 3 years ago

Fuzzy matches are now weighted by length, so a lower-ranked but shorter match has a chance to come in above a higher-ranked but longer match. For example, with the search text sshmyk and these two matches:

[ssh -i ~/.ssh/my_k]ey user@host
[Some gigantically long String wHich has already Matched 'ssh' but will not complete the match until a Y and a K] are found

The difference in the two match lengths (18 vs 111 characters) is factored into the weight calculation. Here the real ssh command gets a rank bump of +0.86 while the longer string's rank is increased only by +0.14, so the ssh command can overcome an unweighted rank disparity of up to 0.72.

If the longer match were instead, say, 30 characters, the ssh command would be ranked at +0.63 while the longer string would add +0.38. The ssh command's unweighted rank would have to be within 0.25 of the longer string to jump ahead.

cantino commented 3 years ago

I think this is a very sensible approach. Ideally, it'd be nice to empirically derive some sort of weighting for fuzzy matches that takes into account how long the character runs are, how many different sections it's matching, etc., but to do this, I think we'd need a bunch of training data that we don't have. For the time being, this seems fine if it's working well for you in practice.

cantino commented 3 years ago

I'll install this locally and start using it in fuzzy mode as well.

The code looks good. Would you mind running cargo clippy and cargo fmt on it, though, I think there are a couple style differences.

dmfay commented 3 years ago

Updated with formatting. I've been using it locally and am pretty happy with it; there's definitely room for further improvements but it's already quite useful.

cantino commented 3 years ago

Been working for me! Thanks for the contribution :)

cantino commented 3 years ago

Released in v0.5.1 :)