atom / fuzzy-finder

Find and open files quickly
MIT License
275 stars 138 forks source link

Use git ls-files for loading paths when available. #301

Closed eeejay closed 5 years ago

eeejay commented 7 years ago

Description of the Change

Leverage git to index a git-controlled project quickly by using git-ls-files where appropriate

Background

Walking though a large codebase with node fs is very slow. Gecko has almost 200k files, and it takes 17 seconds on a fast machine. The project is constantly being reindexed after each editor focus. There needs to be a better way.

Solution

Allow a native binary to walk the tree. A simple find or git ls-files is 40x faster than the current nodejs task. I used dugite here to wrap the git commands instead of using child_process because it is more portable and Atom already pulls it in as a dependency elsewhere.

EDIT: This is faster because the non-git solution needs to query each file if it is ignored or not. This means reading, at least, .gitignore + .git/info/exclude on each traversed node.

Alternate Designs

  1. find on POSIX systems performs just as well, I decided to go with git-ls-files because it is more portable, and should be available on Windows, and because it has a simple flag for excluding VCS ignored files.
  2. Use a node, libgit2 based, library.git-utils doesn't expose an API that would allow this. I tried replicating git-ls-files with nodegit to see if it is possible, and it is 4x slower. So I think the git command line solution is the best.
  3. Use child_process.spawn. This is twice as fast as this patch, because it would allow streaming the stdout. dugite currently does not allow that and needs a fixed buffer. I went with dugite because it seemed the most portable, even though it is pretty certain Atom will have access to git in its PATH.

Benefits

A 40x speedup. Time to index Gecko before patch: 16.5s. After patch: 0.4s

Possible Drawbacks

Following symlink subdirs is broken in this solution.

Applicable Issues

What does this mean??

eeejay commented 7 years ago

I guess this could be a fix for issue #271

50Wliu commented 7 years ago

This needs tests before it can be considered.

Following symlink subdirs is broken in this solution.

Do any of the alternatives handle symlinks correctly?

What does this mean??

You figured it out in your next comment :wink:

even though it is pretty certain Atom will have access to git in its PATH.

At least on Windows, it is not required to have Git installed when using Atom.

eeejay commented 7 years ago

I can't figure out what kind of test would be useful here. Ultimately, this change is great because all the regression tests pass, which proves that this change is functionally identical where it matters.

Do you have any ideas as to what kind of test would be good here?

As for the symlink issue, we can potentially use find for that. It wouldn't work on Windows, which is probably fine because symlinks don't work on Windows anyway, afaik. The strategy would look like this:

  1. Run something like find -type l -xtype d to get all the directory symbolic links.
  2. Use the output from the previous step to exclude all those matches from git ls-files (-x)
  3. Run another find to traverse the symbolic linked directories and append output to results from previous step. This step would only be needed if there are actual folders.

I don't think that is ideal, but I also see a conflict with core.followSymlinks and core.excludeVcsIgnoredPaths. Symlinks are never followed in git, it sees them as a leaf node, so their contents are implicitly ignored. So, if you want to exclude VCS ignored paths, you would never follow a version controlled symlink. The only time you would follow a symlink in a VCS folder is if it were untracked. With that in mind, I would suggest the equivalent of these steps:

  1. Capture files in git cache: git ls-files -c --exclude-standard
  2. Capture untracked (other files) git ls-files -o --exclude-standard pipe output to find -L
  3. Append both lists from previous steps as result.
50Wliu commented 7 years ago

Do you have any ideas as to what kind of test would be good here?

Disclaimer: I don't know what the code coverage for fuzzy-finder looks like. It could be that the test I'm suggesting already exists.

I'd like to see a test asserting that we don't attempt to use the Git method when loading paths for a non-Git project.

eeejay commented 7 years ago

I'd like to see a test asserting that we don't attempt to use the Git method when loading paths for a non-Git project.

This is already tested for!

eeejay commented 7 years ago

Updated! (don't know if you get notified when another commit is pushed)

50Wliu commented 7 years ago

:+1: Sorry for the late reply; was on vacation. I've brought this PR up to the other team members for consideration.

ungb commented 7 years ago

sorry for the late arrival to this PR. Here are the following test scenario I plan to run through:

I'll be spending time this week to look into these, please let me know if you feel like anything is missing from this list. I'll be testing first before the changes and reporting back if anything changed or is the same.

One thing I see is:

Following symlink subdirs is broken in this solution.

@eeejay do you know if this has been fixed with a more recent commit?

ungb commented 6 years ago

Thanks for the work you've put into this PR.

Unfortunately, we won't be able to accept this as it is currently written because it causes a change in behavior with regard to symlinks. We are also looking at making indexing better by incorporating some of the work we've invested in our file watcher infrastructure. If you can update this PR to address the symlinks issue, we'll reevaluate it after the indexing changes.

Thanks again for your help!

eeejay commented 5 years ago

Pull request #367 does this now.