Add option to use `ripgrep` for crawling the list of files

rafeca commented 5 years ago

Summary

This PR contains some minimal changes to demonstrate the usage of ripgrep on the fuzzy finder, so we can discuss the tradeoffs and steps needed to be able to ship this.

This work comes as a follow-up from https://github.com/atom/fuzzy-finder/pull/367 after the suggestions of @smashwilson .

Benefits

Using ripgrep speeds up drastically the time to crawl medium and large repositories (we're taking about up to 14X faster times):

Type	Num Files	Time (master)	Time (This PR)	Improvements
Small	2K	1.3s	0.9s	30% less time (or 1.4X faster) 🎉
Medium	30K	6.2s	1.2s	80% less time (or 5X faster) 🎉
Large	270K	57.5s	4.1s	93% less time (or 14X faster) 🎉

(the measurement has been done the same way as in https://github.com/atom/fuzzy-finder/pull/366).

Possible Drawbacks

The crawling behaviour with ripgrep is slightly different than the one currently implemented. I don't think that the changes are important enough (or even bad), but it's important to list them:

ripgrep also returns symlink destination files (e.g if there's a symlink ./foo.js which points to ./bar.js, with ripgrep foo.js can also be opened. I think this is an improvement.
The ordering of the returned results is slightly different: right now results are returned in a DFS ordering (then alphabetically), while ripgrep returns all results alphabetically ordered.
Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

Regarding 2., this change is quite noticeable, since the fuzzy finder currently displays the first 10 files returned by the crawler when it gets opened. This means that the first shown files will be slightly different than before (IMO we should change the default list of files that are shown when opening the fuzzy finder to show e.g recently opened files or currently opened files).

Alternate Designs

Other potential designs were presented here, and seems like ripgrep is the best tool to use at this moment.

Regarding the current implementation, I've decided to use vscode-ripgrep, which is just a package that handles the download of the ripgrep binaries for several platforms. This has saved me a ton of time, but I'm not familiar enough with the process to install bundled extensions in Atom to know if this package is suitable for the job.

Next steps

[x] Write some automated tests to verify the ripgrep crawling.
[x] Manually check that ripgrep works well on different scenarios.
- [x] Different types of file system (FAT, NTFS, ...).
- [x] Different OSes (Windows).
- [x] Check that the ripgrep binary can be used correctly from a production build (e.g does the asar packaging cause any issue?).
- [x] OSX
- [x] Windows
[x] Check that the behaviour is correct.
- [x] Files/folders with utf8 names.
- [x] List of returned files is the same in a few different repos.

Applicable Issues

https://github.com/atom/fuzzy-finder/issues/271

rafeca commented 5 years ago

Small update: I just realized that when using ripgrep we don't need to initialize the Git native code (see commit) 😃

This is quite slow (specially for large repos) so it gives us even greater performance improvements (we can reduce the times on large repos by another 50% 🚀). I'm gonna update the results table with the new data

rafeca commented 5 years ago

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR:

What happens now in master after opening Atom (or focusing Atom back):

master

The same flow with this PR:

ripgrep

jasonrudolph commented 5 years ago

Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

I consider this a huge improvement. 😍

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR...

Wow! 🏎💨

Next steps...

Thanks for listing these out. This seems promising and very much worth pursuing. :zap:

rafeca commented 5 years ago

The testing so far seems to be going well:

I've modified the test suite to run all the fuzzy finder tests both with and without ripgrep, this way we have good coverage for ripgrep.
The functionality works fine on Windows and on different file systems.
I've been able to test a prod build on OSX and in fact I had to do a small fix to support asar archives.
I haven't been able to build Atom on Windows yet... This is the last bit of remaining work.

A couple of drawbacks that I've discovered today:

The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).
Currently, vscode-ripgrep downloads the needed binary from GitHub during installation (code). This causes failures on CI systems due to rate limits. I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

@jasonrudolph any thoughts about these drawbacks?

jasonrudolph commented 5 years ago

Nice work, @rafeca!

The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).

This seems like a worthwhile tradeoff to me. We're trading a relatively tiny bit of additional disk space and bandwidth, and it turn we spend less time waiting for fuzzy-finder results. 👍

I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

We can generate a scopeless API token for our build account and use that token for atom/fuzzy-finder builds and atom/atom builds. Let's coordinate via Slack to make this happen.

rafeca commented 5 years ago

Awesome! I agree it's a worthwhile tradeoff 😃

I've been able to verify that Windows builds are packaged correctly with ripgrep, which was the last item from my list of things to consider... So this PR is ready for review 😄

rafeca commented 5 years ago

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover (you have to go to settings -> packages -> search fuzzy finder and click on its settings.

This is how this setting looks like:

I have mixed feelings about this:

The benefit is that by having the setting we can find early issues from people that opt in before we turn it on by default.
The problem is that the setting is so hidden that I don't expect many people to turn it on, so we're delaying the impact of it until we make it on by default (which takes at least one month) without getting much more testing in exchange.

Couple of things that we could do:

Turn it on by default (and remove the "experimental" wording from the setting). If under some situations something goes wrong we can tell people to turn it off as a workaround.
Leave it as it is: one month is not that much time and even a little bit of extra testing is appreciated.
Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

50Wliu commented 5 years ago

Another time we implemented an experimental setting was when we started using fuzzaldrin-plus as an alternative to fuzzaldrin. The setting was off by default - you'll actually notice it right above your new setting in the last screenshot you posted.

Whenever people would open issues noting odd fuzzy finding behavior, we would ask them to try turning on the setting and reporting back. I think we can do the same thing here - keep the setting off by default, but advertise that the setting exists and will become the default in the near future (as we did for tree-sitter).

Looking back, what I think we could have done better is set a timeline for when to enable the new scoring method by default. As you can see, it's been three or four years, and I'm still not sure if it's enabled by default yet on all the packages that have it as an option. That was something we handled much better with tree-sitter.

I would opt towards making it off by default, advertising it when we can, and determining a target version to have it on by default. If by that target version it seems like it's still not ready for prime time, then we can re-evaluate.

rafeca commented 5 years ago

@50Wliu that sounds good to me.

If I understand correctly our release cadence, if this PR lands now it'll get included in v1.37.0 which will be released around May. We could aim to enable it by default by v1.38.0 or v1.39.0 depending on the feedback/issues reported by users (this can be decided later though).

jasonrudolph commented 5 years ago

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover...

Couple of things that we could do:

...

...

Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. :smile:

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

rafeca commented 5 years ago

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. 😄

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

Awesome! We can implement it as a followup 😄

black-snow commented 5 years ago

Read the release notes, followed the path to this PR, followed your guide where to find the new setting and hit the checkbox. I think there will be more people going down this route :)

atom / fuzzy-finder