Gilnaa / globwalk

MIT License
47 stars 14 forks source link

globwalk looks at all files even when it doesn't need to #29

Open jyn514 opened 4 years ago

jyn514 commented 4 years ago

https://docs.rs/globwalk/0.8.0/src/globwalk/lib.rs.html#355

If the top-level directory is not a match, no subdirectories or files in the directory will be a match either. However globwalk will still look at them. If there are many files in the ignored directory, this can cause enormous slowdowns. In https://github.com/rust-lang/docs.rs/pull/861/files#diff-ee6431d852ce8913514eece9e3982d32R96-R99, we have several hundred thousand files in subdirectories, but only ~10 in the matched directory, so this causes a slowdown of several orders of magnitude.

jyn514 commented 4 years ago

I'm willing to work on this.

robinfriedli commented 3 years ago

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys, I've never even let it finish so who knows how much longer it would have spent scanning my entire file system when it really should just find two html files in src/resources/templates/*.html. I haven't noticed it when running it locally but even then it walks through the entire target/ directory.

Luckily, the fix appears to be rather simple. The iterator does not handle the case where ignore::overrides::Override::matched returns Match::None and only skips the directory if it returns Match::Ignore. So adding the following after line 410 should suffice:

Match::None if is_dir => {
    skip_dir = true;
    continue 'skipper;
}

Then globwalk seems to skip irrelevant directories as expected.

Gilnaa commented 3 years ago

@robinfriedli thank you for working on it! I'll set a reminder to tix this in the upcoming days

Gilnaa commented 3 years ago

After a brief check, it sadly doesn't work exactly like that, and filters out correct files. After applying the patch, the following tests fail after not finding any files:

failures:
    tests::test_blacklist
    tests::test_case_insensitive_matching
    tests::test_from_patterns

It looks like Match::None means that the path (the directory) did not match any of the paths, but it still can be a prefix of a correct match.

Can you supply an example where this worked for you?

robinfriedli commented 3 years ago

Yeah I jumped the gun a bit here. It is not quite as simple unfortunately as the directory being part of the pattern also results in a Match::None. Maybe it would be enough to check whether the directory is a prefix of the glob?

Gilnaa commented 3 years ago

We can construct a second glob for dirs that contains the prefixes of the original positive patterns. Not sure this cover all cases; especially WRT to **, but it's a good start

robinfriedli commented 3 years ago

I thought of that but do we actually want to yield those directories? I guess we could keep those globs separate and only use them to decide on whether to skip. I'm not 100% sure either but creating a glob for each child path (using the path components) might work. For example, for the glob **/templates/*.html we would create the globs **/ and **/templates/, which would match any directory.

robinfriedli commented 3 years ago

That does seem to work to some extent, at least all tests pass (except for the readme test) and it does seem to skip irrelevant directories. I can open a PR if you like.

Gilnaa commented 3 years ago

Gladly. I'll try to attack it with tests over the weekend to see if we missed something.

Thanks again!

robinfriedli commented 3 years ago

No problem 😄Pull request opened

JohnAZoidberg commented 2 years ago

Same here. In my case I'm using Tera, which uses globwalk to handle globs and I noticed that my application is practically unusable when running in a Docker container where the working directory is the root directory of the container because globwalk spends hours walking directories like /sys

Wow, I just ran into this exact same issue. It probably would never finish due to recursive symlinks.