BurntSushi / walkdir

Rust library for walking directories recursively.
The Unlicense
1.21k stars 106 forks source link

Add DirEntry::into_path method #100

Closed ruuda closed 5 years ago

ruuda commented 6 years ago

This pull requests adds fn into_path(self) -> PathBuf to DirEntry.

Motivation

The use case for this is that I have an API like this:

fn from_paths<I: Iterator<Item = AsRef<Path>>>(paths: I)

I would like to be able to pass in something like this:

WalkDir::new(dir).into_iter().map(|e| e.unwrap().path())

Unfortunately that does not work, I have to make a copy:

WalkDir::new(dir).into_iter().map(|e| PathBuf::from(e.unwrap().path()))

This is wasteful; we allocate a pathbuf and copy the path into it, only to destroy the original immediately afterwards. With the proposed into_path, this is possible without the extra copy.

Performance

I have a small program that basically does this:

let dir = env::args().nth(1).unwrap();
let ext = OsStr::new("foo");
let wd = walkdir::WalkDir::new(&dir)
    .follow_links(true)
    .max_open(128);
let paths: Vec<PathBuf> = wd
    .into_iter()
    .map(|e| e.unwrap())
    .filter(|e| e.file_type().is_file())
    .map(|e| PathBuf::from(e.path()))
    .filter(|p| p.extension() == Some(ext))
    .collect();
std::process::abort();

My actual program is part of a larger program and also prints to stdout every 64 iterations. I ran this program on a directory where the iterator yields 12408 paths, with a warm page cache on Linux. Times were recorded by running this under perf stat. I repeated this 16 times for each configuration. The raw data is below, copy for PathBuf::from(e.path()) and noncopy for e.into_path().

copy    <- c(0.023223363, 0.022365082, 0.022318216, 0.022584837,
             0.020660742, 0.023839308, 0.022084252, 0.021812114,
             0.022180668, 0.019982074, 0.020979151, 0.023186709,
             0.024758619, 0.022889618, 0.024148854, 0.024708654)
noncopy <- c(0.022403112, 0.021863389, 0.019650964, 0.020984869,
             0.021901483, 0.021376926, 0.021668108, 0.021504715,
             0.023730031, 0.021861766, 0.021060567, 0.021986531,
             0.022680138, 0.019719019, 0.020053399, 0.021137137)
t.test(copy, noncopy)
    Welch Two Sample t-test

data:  copy and noncopy
t = 2.6055, df = 28.297, p-value = 0.01447
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.000242829 0.002024684
sample estimates:
 mean of x  mean of y 
0.02260764 0.02147388 

That’s about a 5% speedup.

ruuda commented 5 years ago

Thanks!