BurntSushi / walkdir

Rust library for walking directories recursively.
The Unlicense
1.21k stars 106 forks source link

Inconsistent behaviour of `skip_current_dir` when combined with `same_file_system (false)` and symlinks to other file-systems #166

Open cipriancraciun opened 1 year ago

cipriancraciun commented 1 year ago

When one uses follow_links (true) but also same_file_system (true), and walkdir encounters a symlink to a folder on a different file-system (which will thus not be recursed-into), it will report it as a file or folder (assuming the pointed-to exists), and thus calling skip_current_dir will in fact skip iterating the parent folder of that symlink.

I don't know if this is actually an issue within walkdir itself, or the code that uses walkdir in such a manner.

Perhaps the best solution is to add a new method DirEntry::is_recursed that states if the current entry is or not actually the subject of recursion, and thus if calling skip_current_dir will affect this entry and not its parent.

Another solution, perhaps best if viewed from a security point-of-view, is to treat symlinks to entities outside the current file-system as broken links, thus the DirEntry::file_type and DirEntry::metadata would actually belong to the symlink and not to the pointed-to file or folder.

BurntSushi commented 1 year ago

I unfortunately can't make sense of your first paragraph. You mention the symlink is not recursed into, but also talk about skip_current_dir... So I'm not sure what you mean.

I think examples will help, or else I'll have to repro it when I get time. (Which might be a while.)

BurntSushi commented 1 year ago

To be clear, I'm unsure what you're actually calling inconsistent.

cipriancraciun commented 1 year ago

I'll try to summarize the issue:

When iterating, walkdir first returns a which .file_type().is_dir() (based on the .metadata() of the pointed-to folder according to follow_links (true)), then it returns b which also is .file_type().is_dir(), then it returns children from b, then it returns c. As expected it skips over a's children because it's not on the same file-system although follow_symlinks (true) was set.

However, if one tries to asses both a and b based only on the .metadata() returned by walkdir::DirEntry, one doesn't see any difference between them, both are .path_is_symlink() and both are folders according to .file_type().is_dir(). Thus if (somewhere down the line) some filter code decides that one of these folders should be not recursed-into and calls .skip_current_dir() the behaviour differs in the two cases:

The problem is that walkdir treats both a and b as symlinks to folders (thus the .metadata() is that of the pointed to folder) when it comes for DirEntry purposes, but then it treats them differently when it comes to recursion and the behaviour of .skip_current_dir().

(For example I had tool that uses walkdir and it worked just fine by ignoring two folders that happened to be symlinks to folders on the same file-system, but then when one of the folders was moved on a different file-system, the tool started to misbehave by stopping the iteration of the parent folder.)

My proposal was to fix this corner-case by either making .skip_current_dir() take into account this particularity, or better, by making the DirEntry that represents a in our case be something akin to a "broken symlink", i.e. the .metadata() should return the lstat of the symlink.

To summarize: when it comes to symlinks, .same_file_system (true) should take priority over .follow_links (true), and the code should behave (only when presented with a symlink pointing to another file-system and same_file_system (true)) just as if walkdir was configured with .follow_links (false).

For example in handle_entry one should also check if .opts.same_file_system was set and in that case only self.follow(dent) only if the symlink points inside the same file-system as root.

https://github.com/BurntSushi/walkdir/blob/master/src/lib.rs#L821-L823

BurntSushi commented 1 year ago

Thank you for that thorough explanation! That makes things much clearer.

by making the DirEntry that represents a in our case be something akin to a "broken symlink", i.e. the .metadata() should return the lstat of the symlink.

So I do like this answer, but I also worry that it might be problematic to represent a non-broken symlink as if it were a broken symlink. It's difficult to come up with concrete problems here, but I am worried. On the other hand, if one thinks of "same file system" as a conceptual model where anything outside the file system the root directory is on is considered to basically not exist, then I think it does make sense...

Otherwise, I think it seems like the least worst option. I'd prefer not to add any new public APIs to fix this for example, because it's not reasonable to expect folks to deeply reason about this and correctly use some new API to work around it. The other option, making skip_current_dir aware of this peculiarity, also seems less clean to me.

PRs for this are welcome, as I'm not sure when I'll have time to get around to this myself.

cipriancraciun commented 1 year ago

PRs for this are welcome, as I'm not sure when I'll have time to get around to this myself.

So, you would accept a PR which implements the "broken symlink" workaround as described above?

(I'm not promissing much, but if I get some time I'll try to implement it. At the moment I've made a workaround in my code that doesn't rely on .skip_current_dir().)

BurntSushi commented 1 year ago

I'm not 100% certain I would take it, but I'm pretty sure. I can't really think of a better path here.

On Sat, Aug 20, 2022, 13:52 Ciprian Dorin Craciun @.***> wrote:

PRs for this are welcome, as I'm not sure when I'll have time to get around to this myself.

So, you would accept a PR which implements the "broken symlink" workaround as described above?

(I'm not promissing much, but if I get some time I'll try to implement it. At the moment I've made a workaround in my code that doesn't rely on .skip_current_dir().)

— Reply to this email directly, view it on GitHub https://github.com/BurntSushi/walkdir/issues/166#issuecomment-1221376188, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADPPYSMHXWPXFYZTUUO6XDV2ELMTANCNFSM57C7Y4WA . You are receiving this because you commented.Message ID: @.***>