BurntSushi / walkdir

Rust library for walking directories recursively.
The Unlicense
1.21k stars 106 forks source link

Help needed: filter_entry returns after first entry #142

Closed ololduck closed 3 years ago

ololduck commented 3 years ago

Hello,

Sorry to bother you with this, but I'm having a weird behavior with filter_entry. It may be due to the incompleteness of my rust understanding.

I'm trying to filter a file tree to return all regular files (as in: not directories) matching at least one element in a set of regular expressions.

So if i had a simple rust project, it would return src/main.rs and cargo.toml, if the regexps were .*\.rs & .*\.toml.

Here's some code supposed to produced this behavior:

src/main.rs:

use regex::Regex;
use std::fmt::Display;
use std::path::Path;
use walkdir::WalkDir;

fn _matches<T: AsRef<str> + Display>(e: &Path, regexps: &[T]) -> bool {
    let dot_git_re = Regex::new(".git/*").expect("invalid git regex");
    for regex in regexps {
        let r = Regex::new(regex.as_ref()).expect(&format!("invalid regex: {}", regex));
        if r.is_match(&e.display().to_string())
            && !(dot_git_re.is_match(&e.display().to_string()) || e.is_dir())
        {
            println!("Found matching file {}", e.display());
            return true;
        }
    }
    println!("File {} didn't match", e.display());
    false
}

pub fn get_files<T: AsRef<str> + Display>(
    base_dir: &str,
    regexps: &[T],
) -> anyhow::Result<Vec<String>> {
    // this is to confirm walkdir indeed sees the files i want to check, and not some OS issue
    // it produces all the Entry: Ok(Direntry("…")) lines
    WalkDir::new(base_dir)
        .into_iter()
        .map(|entry| {
            println!("Entry: {:?}", entry);
        })
        .for_each(drop);
    // this is the real code
    let flist = WalkDir::new(base_dir)
        .into_iter()
        .filter_entry(|e| {
            println!("trying file {}", e.path().display());
            _matches(e.path(), regexps)
        })
        .map(|e| {
            println!("Adding file {:?}", e);
            let e = e.unwrap();
            e.path().display().to_string()
        })
        .collect();
    println!("final list: {:?}", flist);
    Ok(flist)
}

fn main() -> anyhow::Result<()> {
    let regexps = vec![".*\\.rs", ".*\\.toml"];
    let files = get_files(".", &regexps)?;
    println!("{:#?}", files);
    Ok(())
}

cargo.toml:

[package]
name = "walkdir-test"
version = "0.1.0"
authors = ["me <me@me.me>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
walkdir="2"
regex="1"
anyhow ="1"

Here's it's output:

$ cargo run | grep -v ./target | grep -v ./.git
   Compiling walkdir-test v0.1.0 (/tmp/walkdir-test)
    Finished dev [unoptimized + debuginfo] target(s) in 0.58s
     Running `target/debug/walkdir-test`
Entry: Ok(DirEntry("."))
Entry: Ok(DirEntry("./Cargo.lock"))
Entry: Ok(DirEntry("./src"))
Entry: Ok(DirEntry("./src/main.rs"))
Entry: Ok(DirEntry("./Cargo.toml"))
trying file .
File . didn't match
final list: []
[]

I'm still in the design phase, so the code is not properly optimized, I hope to change it in the future.

If the previous code worked as intended, i would have at least:

  1. trying file {file}
  2. File {file} didn't match OR Found matching file {file}
  3. Adding file {file} if matched in previous step.

As we can see in the output, it only considers the first directory (the base_dir parameter) and stops its evaluation.

Do you have any pointer that would help me understand what i did wrong?

paul

BurntSushi commented 3 years ago

Thanks for asking a question. I definitely encourage that. However, when asking questions, please do your best to at least provide the following things:

  1. A complete program that others can compile and run. Where possible, reducing the size of this program is preferred.
  2. Provide a sample input to the program. (In this case, I'd expect that to be commands to produce a directory tree.)
  3. Provide the actual output you get from running the program.
  4. Provide the expected or desired output you want.
  5. Where possible, describe the problem you're trying to solve at a high level.

When you don't provide these things, you make it more difficult for people to answer your questions. For example, in this case:

  1. I cannot easily compile the code you've given me. Maybe I could fiddle with it, but it might take several minutes. And in doing so, I may need to guess at some things that could potentially impact the behavior in a way that differs from what you experience.
  2. You've shown the actual output of the program but have not shown the output that you want/expected.
  3. It's not actually clear to me what it is you're trying to do. From what I can tell, you're using filter_entry and it's not behaving as you would expect, but I don't know what you're expectations are. I could guess, but maybe I'd be wrong and we'd wind up going in circles.
  4. I don't know what problem you're trying to solve at a high level.

Please consider doing these things next time you ask a question. Not doing these things puts an additional burden on the folks you're asking to help you.

In the interest of helping you, from what I can see, my guess is that your regexes don't match base_dir and thus filter_entry excludes it and your program won't descend into that directory.

ololduck commented 3 years ago

Thank you for your input, i edited the original post to (hopefully) make it more understandable.

From your last sentence, I think i understand what you mean. since my _match function returns false on dirs, filter_entry won't go down any directory. It makes sense!

I guess i'll make the dir evaluation in a second pass, once all the files have been collected.

ololduck commented 3 years ago

Hello!

I ditched filter_entry, and used plain old Iter.filter, and everything works as expected. I'm closing this issue, and appologize for my incomplete comprehension of filter_entry...

Here's the "final" get_files:

pub fn get_files<T: AsRef<str> + Display>(
    base_dir: &str,
    regexps: &[T],
) -> anyhow::Result<Vec<String>> {
    let final_list = WalkDir::new(base_dir)
        .into_iter()
        .filter_map(Result::ok)
        .filter(|e| {
            debug!("trying file {}", e.path().display());
            _matches(e.path(), regexps)
        })
        .map(|e| {
            debug!("Adding file {:?}", e);
            e.path().display().to_string()
        })
        .collect();
    debug!("final list: {:?}", final_list);
    Ok(final_list)
}

Thanks, paul

BurntSushi commented 3 years ago

Right. From the docs:

Yields only entries which satisfy the given predicate and skips descending into directories that do not satisfy the given predicate.

So if you're returning false for a directory, then walkdir won't descend into it.

If you only care about files and only want to apply a filter to files, then using filter is fine. That is, there is no advantage to using filter_entry.

Thank you for following up and sharing your solution and also for filling in the original issue! I appreciate it.