RFC: use git ls to crawl all files in the project

rafeca commented 5 years ago

Summary

This is some code to demonstrate the usage of git ls-files on the fuzzy finder, so we can discuss if it makes sense to implement this solution.

In terms of performance, this PR provides huge benefits for large repositories: for example it makes the crawling 5X faster when opening the gecko-dev repository (it goes from 57s in the current implementation after merging #366 to 11s).

Tradeoffs

Using git ls-files has to important tradeoffs:

It does not support symlinks (as stated in #301). To overcome this tradeoff I've decided to only enable this codepath whenever traverseSymlinkDirectories is false (which unfortunately would not happen often since it's enabled by default).
It does not support git submodules. This means that on a repo that has submodules the files inside these won't be accessible by the fuzzy finder.

The second tradeoff is quite a big deal, so if we ever want to enable this crawler we would want to have it under some kind of feature flag that could be enabled on the settings (e.g "Enable fast mode in fuzzy finder", showing some messaging around the tradeoffs of this mode.

Additionally, in order to make this mode more discoverable we could show some kind of prompt on the Atom UI whenever somebody opens a large project: if on the first crawling we detect that the repo has more than e.g 20K files we show a message suggesting enabling the fast mode.

Alternative solutions

This is a very ad-hoc solution for a very specific problem: It does not fix the current architectural issues on the Fuzzy Finder around file watching, recrawling, etc.

If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.

That "virtual filesystem" could probably be built on top of @atom/watcher, and could be used by other packages like the Tree View.

It's important to note that this alternative is way more complex to implement and would take much longer to ship.

Next potential steps

This is just a WIP code to get some signal about whether we want to invest more on this path. if we want to move forward, there are a few things (some of them not trivial) that we'd need to do to be able to ship this:

[ ] Implement the missing APIs on git-utils and libgit2 to avoid spawning a git process.
[ ] Add some kind of setting to enable this fast mode.
[ ] Make the setting discoverable by adding a prompt on large projects.

lee-dohm commented 5 years ago

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

rafeca commented 5 years ago

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

Exactly, but currently the biggest perf bottleneck on the fuzzy finder happens only on Git projects due to the checks to see if a file is ignored by Git, which takes more than 80% of the time (more info here).

smashwilson commented 5 years ago

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

rafeca commented 5 years ago

Thanks for the comments @smashwilson !

Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.

Yup, completely agree. I would implement the virtual filesystem as an abstraction layer on the actual fs, with just a subset of the available APIs. Still, designing something like that takes a substantial amount of time and the user benefits (apart from the architectural ones) are not clear.

I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore-compatible path filtering, but traversed symlinks and into submodules?

I'd try to avoid doing this: reimplementing a file tree traverser that takes care of .gitignore rules would be quite a bit of effort as well (we would have to replicate git's behaviour around nested .gitignore rules, user-defined rules, etc), and we won't really know what perf gains will this give us until it's done (probably we would have to put a lot of effort on optimizations to reach git perf).

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

I really like this solution! ripgrep should give us a lot of performance out of the box, and it should not be hard to hack on a prototype to get actual improvement values (so then we can decide if it's worth it).

rafeca commented 5 years ago

fwiw, git ls-files has an option to traverse submodules, but it's currently incompatible with the option that returns untracked files.

maxbrunsfeld commented 5 years ago

Super excited about this. I like the ripgrep approach because it would help both find-and-replace and fuzzy-finder. You can also still use it if the user turns off gitignore mode and wants to see all files (I forget what the setting is called; it’s different in the two packages). VS code also already has code for parsing ripgrep’s output I believe.

rafeca commented 5 years ago

I've created https://github.com/atom/fuzzy-finder/pull/369 as a follow-up of this discussion, thanks everyone for the suggestions! ✨

eeejay commented 5 years ago

I proposed this exact change 2 years ago in pull request #301. It was rejected because it didn't handle symlinks. I don't think this change does that either.

eeejay commented 5 years ago

Ah, I just saw that in the tradeoffs. nm!

atom / fuzzy-finder