readdir() performance on large directories

phiresky commented 6 years ago

See this article: http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

ls and practically every other method of listing a directory (including python os.listdir, find .) rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (i.e. 500M of directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() syscall directly, rather than helper methods from libc.

I'm just mentioning this since you have a comparison to find and using a larger buffer size than 32kByte (~400 files per syscall) might improve walking performance further.

BurntSushi commented 6 years ago

This crate doesn't deal with syscalls directly. We can do it if it's necessary and well motivated, but I'm not getting that from this issue. What specific action are you requesting here?

phiresky commented 6 years ago

I just saw that the syscall is handled by fs::readdir which is specialized to call into libc for unix systems in the standard library. Since this seems to be specific to linux, it would probably need a special case. I guess you'd need some benchmarks beforehand to see if the extra implementation would be worth it. In any case I guess this is out of scope for this project, closing this.

BurntSushi commented 6 years ago

Benchmarks sure... But I would also want to know why all of the standard C libraries for traversing directories on Linux have gotten away without supporting large directories, and why, specifically, walkdir should diverge from that.

phiresky commented 6 years ago

I actually just did a benchmark. The authors of the article probably did something wrong since I'm getting mostly the same result regardless of that buffer size.

Still, seems like rg is a lot slower than both find and ls:


 Performance counter stats for 'rg -j1 --files ./b' (100 runs):
       1,224842372 seconds time elapsed                                          ( +-  0,25% )
 Performance counter stats for 'rg --files ./b' (100 runs):
       0,989319179 seconds time elapsed                                          ( +-  0,52% )
 Performance counter stats for 'ls -f ./b' (100 runs):
       0,444006570 seconds time elapsed                                          ( +-  0,08% )
 Performance counter stats for 'find ./b' (100 runs):
       0,413734409 seconds time elapsed                                          ( +-  0,05% )
 Performance counter stats for './test-readdir ./b' (100 runs):
       0,213617179 seconds time elapsed                                          ( +-  0,06% )
 Performance counter stats for './test-30k ./b' (100 runs):
       0,202921894 seconds time elapsed                                          ( +-  0,06% )
 Performance counter stats for './test-5m ./b' (100 runs):
       0,199983002 seconds time elapsed                                          ( +-  0,08% )

perfresults.txt

# create dir with 1M files
testdir=b
mkdir b
seq 1 1000000 | while read i; do touch b/fiiiiiiiiiiiiiiiiiiiiiiiiiiile$i; done

# ensure warm cache
rg --files $testdir >/dev/null

perf stat -r 100 rg -j1 --files $testdir >/dev/null
perf stat -r 100 rg --files $testdir >/dev/null
perf stat -r 100 ls $testdir >/dev/null
perf stat -r 100 ./test-readdir $testdir >/dev/null
perf stat -r 100 ./test-30k $testdir >/dev/null
perf stat -r 100 ./test-5m $testdir >/dev/null

(the ./test-* binaries are based on their code with different buffer size. can upload code if you want to see it)

@BurntSushi

BurntSushi commented 6 years ago

I'm not sure what ripgrep had to do with this, but if you want to upload that code and create an issue on the ripgrep tracker, then I'd be happy to investigate. It might reveal some constant overhead in ripgrep that makes less of a difference on smaller directories. But in general, ripgrep is not necessarily doing the same work because you did not disable gitignore logic.

phiresky commented 6 years ago

Looks like it's mostly solved by disabling ignores:

 Performance counter stats for 'rg -uu -j1 --files ./b' (100 runs):
       0,771240256 seconds time elapsed                                          ( +-  0,15% )

 Performance counter stats for 'rg -uu --files ./b' (100 runs):
       0,536708845 seconds time elapsed                                          ( +-  0,28% )

 Performance counter stats for '/tmp/walkdir/target/release/examples/walkdir ./b' (100 runs):
       0,466638350 seconds time elapsed                                          ( +-  0,15% )

Still slower than find, but not that significant. Not sure why ignores reduce perf so much even when there is no ignore file and a flat directory.

Also not sure why find takes more than double the time of a pure readdir() loop.

Anyways, this is probably not worth a separate issue for ripgrep.

BurntSushi commented 6 years ago

Just because there isn't an obvious ignore file doesn't mean one isn't being used. You need to confirm by looking at the output of --debug.

phiresky commented 6 years ago

Jep, nevermind. There's an ignore file in my home directory.

BurntSushi / walkdir

readdir() performance on large directories #108