use `os.scandir()` for improved performance

akaihola commented 6 years ago

The os.scandir() function introduced in Python 3.5 offers improved performance when the file type and attributes are needed as well.

We could keep using os.listdir on Python <3.5 and wrap file names in os.DirEntry-like objects which would support just the functionality we need in hardlinkpy.

We should first of course be compatible with Python 3 to start with... see #4.

chadnetzer commented 6 years ago

I have a branch that replaces the current directory walking with os.walk(), which uses os.scandir() on recent Python 3 (which can greatly improve performance of os.walk()). The benefit (for now) of moving to os.walk(), rather than directly to os.scandir() would be to maintain compatibility for both Python 2.7 and 3, while getting some of the performance benefits. Also, it structures the code better (imo) than the current walk, because it allows all the exclude and match logic to be in the same place (rather than spread over two functions like it is now).

The downside of os.walk() is that we still have to do the os.stat() call on each file, whereas the os.scandir() would do that for us. So on Python2 at least, the os.walk() method may be even slower than the current tree walk. I think, for the short term, it'd be worth maintaining both Python 2.7 and Python 3 compatibility (w/ the same code base), at least for one well tested release w/ the currently proposed new features and fixes. Then if/when the decision is made to drop Python 2 support, there is a version that Python 2 users can resort to in a pinch.

chadnetzer commented 6 years ago

Hmmm, with some refactoring of main() and hardlink_identical_files(), it appears possible to avoid using os.walk() while supporting it's directory pre-culling semantics (with the exclude option). This should also allow directly using os.scandir() for Python 3 (and falling back to os.listdir() on Python 2). I'll put both approaches up for review in a day or two.

chadnetzer commented 6 years ago

I did some testing with a tree walking implementation that uses os.walk(), and one which uses os.scandir() (running on Python3). The performance for both is about equal on Python 3, at least with a shallow and wide directory tree. One reason scandir() doesn't particularly benefit us, is that we can't avoid doing an os.stat(). We need the st_size data for each file, as well as st_mtime sometimes, neither of which is supplied directly by os.scandir(). We also require st_dev, although that could at least be queried per-directory instead of per file (with some additional code complexity).

akaihola / hardlinkpy

use `os.scandir()` for improved performance #15