Complex filesystem loops cause multiple-counting, runaway recursion and CPU usage

dividedmind commented 1 year ago

Walkdir has simple filesystem loop detection, but more complex filesystem loop structures still lead to very deep, looped recursion with runaway CPU and I/O usage. This is not only an academic point, as it can happen in realistic scenarios. Here's a simple example to reproduce the problem (using ripgrep; the example needs relatively recent yarn, whose pnpm linking strategy relies heavily on symlinks):

$ echo 'nodeLinker: pnpm' > .yarnrc.yml
$ echo '{}' > package.json
$ yarn add jest
$ time (rg --files -g '**' | wc)  # works fine
   5204    5204  521743

real    0m0,026s
user    0m0,027s
sys 0m0,042s
$ time (rg --files -g '**' --follow | wc)  # saturates the CPU for a long while and prints thousands of symlink loop warnings
28773809 28773809 11025122280

real    4m26,497s
user    3m42,070s
sys 12m14,423s

Of course one could simply not use --follow in this case, but sometimes there are other symlinks in the tree that you do want to follow. Also some tools use ripgrep internally not giving you a choice of whether to follow or not (VSCode's findFiles mechanism for instance).

dividedmind commented 1 year ago

One idea to avoid the issue would be to not follow any directory symlinks that point into the forest being examined. The forest would initially only contain the tree rooted in the initial search root, but any symlinks pointing outside would add another tree to the forest (to avoid having the same problem if there's a symlink pointing to a problematic structure that's located outside the initial search root). This can still be inaccurate (if there's an external symlink to an ancestor of a previously encountered external symlink) but should fix runaway recursion and will only cause double-counting in rare circumstances.

A more accurate solution would be to keep a complete hash set of realpaths of visited directories, but that's much more CPU and memory overhead.

BurntSushi commented 1 year ago

It's not clear to me that the commands you've provided use this crate at all.

Can you please provide a reproduction using a simple Rust program and a simple bash script to setup the input?

I'm also not sure I buy your characterization here. If there's a loop, then directory traversal wouldn't terminate. The fact that it does terminate for you seems to imply that loop detection is working as designed.

Otherwise, symlinks can create very deeply inter-connected trees. If you don't want to spend the resources required to traverse them, then I don't think there is any solid solution other than not traversing them. If you use tools that don't let you configure ripgrep to skip paths or disable following symlinks, then I think that's an issue with the tool right?

dividedmind commented 1 year ago

It's not clear to me that the commands you've provided use this crate at all.

Can you please provide a reproduction using a simple Rust program and a simple bash script to setup the input?

I'm not a Rust developer but I found the issue when using VSCode and followed by reading the code through ripgrep down to this crate. I figured I'd report it directly where it could be fixed instead of going through the chain, but if you prefer I can bring it up with downstreams.

I'm also not sure I buy your characterization here. If there's a loop, then directory traversal wouldn't terminate. The fact that it does terminate for you seems to imply that loop detection is working as designed.

I suppose you're right, I might have mischaracterized the problem; it is only tangentially related to loop detection.

Otherwise, symlinks can create very deeply inter-connected trees. If you don't want to spend the resources required to traverse them, then I don't think there is any solid solution other than not traversing them. If you use tools that don't let you configure ripgrep to skip paths or disable following symlinks, then I think that's an issue with the tool right?

The problem is that I'd expect the files not to be double (or in this example 5000x) counted. Enabling symlink following shouldn't simply break walking trees with complex structures, leading to unexpected and surprising resource usage. If you have a tree that bona fide has 25 M leaves then you can expect this to take long, but as it is you have a tree with 5k entries that generates 11 GB of walk data, counting each file 5000 times. This can also be potentially a security problem if something uses walkdir/ripgrep on an untrusted input leading to a denial of service.

Even if a tool allows one to configure ripgrep to not follow or skip, I believe this is still still an issue because you need to know beforehand that a part of the tree is problematic. And sometimes you really do want to examine a tree that has those deeply interconnected substructures while following symlinks.

Maybe such pruning does not need to be the default in walkdir, but it's IMO definitely something that defensively written software should be able to opt into.

BurntSushi commented 1 year ago

I maintain both this and ripgrep. So I suppose it's not too much of a difference, but I'd say this is overall more in ripgrep's domain. This is just a crate for walking the filesystem with a minimal guarantee of termination. With that said, I don't want you to go an open an issue on ripgrep only for me to just close it as wontfix (since I suspect that's what I'd do). So we can hash it out a little more here...

The problem is that I'd expect the files not to be double (or in this example 5000x) counted.

So this is a much clearer statement of the problem. That is, I think you're asking for ripgrep to de-duplicate its output regardless of symbolic links. That is, if a file or directory has already been visited, then don't visit it again. The problem with this feature is that it's a complicated engineering task to implement. You can't just add a HashSet and call it a day. You also need to specify how and which duplicates are searched. And also how to determine whether any given two files are duplicates. I'm pretty sure it's a solvable problem, but it's non-trivial. And it would absolutely need to be opt-in. So even if this existed, you'd have to convince VS Code to pass a flag enabling it (or let you pass the flag to enable it).

This can also be potentially a security problem if something uses walkdir/ripgrep on an untrusted input leading to a denial of service.

ripgrep does not guard against such things. There are a whole mess of ways you can make ripgrep spend a lot of time searching a directory tree today. There is no real way, and never will be a way, to run ripgrep run on arbitrary untrusted input and get a guarantee about its running time. So this is not a compelling argument to me.

Maybe such pruning does not need to be the default in walkdir, but it's IMO definitely something that defensively written software should be able to opt into.

That's already true today. Following symbolic links is an opt-in behavior. Neither ripgrep nor walkdir has this problem by default. It's other tools, like VS Code, that are seemingly opting you into this by default. That's their issue.

I think that if you wanted to file an issue asking for a new opt-in feature in ripgrep that does some kind of search result de-duplication based on detecting whether a particular file has already been searched, then I'd be open to at least discussing that. But it absolutely has to be opt-in. That's non-negotiable. I can't promise I'll accept it, and even if I do, that it will be implemented in any reasonable time frame.

BurntSushi / walkdir

Complex filesystem loops cause multiple-counting, runaway recursion and CPU usage #191