dsoprea / PyInotify

An efficient and elegant inotify (Linux filesystem activity monitor) library for Python. Python 2 and 3 compatible.
GNU General Public License v2.0
245 stars 73 forks source link

[opt] use scandir to shorten initialization time of Inotify #45

Closed NoneGG closed 6 years ago

NoneGG commented 6 years ago

I want to use inotify to monitor video files on cdn server, and it takes too long time to initialize the InotifyTrees when i run the demo script (about 1061070 files) I notice that os.listdir is used in the code, is there any possibility that we can use scandir.listdir (it is said that scandir will be merged to Python3 official in next release) to optimize the initialization speed?

NoneGG commented 6 years ago

If you allow, i am glad to make a pull request~

xlotlu commented 6 years ago

@NoneGG I implemented the tree handling in #48 using os.walk() instead, which is 4 times faster on rotational media, and about twice as fast on an SSD. That is, if you are using python >= 3.5.

It would be interesting to see a comparison with raw os.scandir(), but I think you can't squeeze much more out of it. os.walk() only does a few extra-operations that aren't of interest to InotifyTree, and neither are I/O-bound. Unless you need support for python < 3.5, of course, then you could depend on scandir, and do the suggested

try:
    from os import scandir
except ImportError:
    from scandir import scandir

If I may make a suggestion though, I think your approach is not the best for your situation, architecturally speaking. Given such a huge tree of files, it's preferable to hook into the code creating / modifying the video files, and callback some handler on the other side -- maybe some API exposed by the code interested in change events. If there are multiple parties interested in changes, then a message queue / fanout system would simplify things greatly.

NoneGG commented 6 years ago

@xlotlu Thank you for your response~ I read your commit and it seems nice.

As far as i know, os.walk in Python (version less than 3.5) still use 'stat' in its realization and will generate lots of io request to disk. As i said before, package scandir is merged into python>=3.5, so it is good to use scandir with Python < 3.5.

Actually the monitor base on inotify is designed for both human operation mistake and code mistake and is still in development now. Your suggestion sounds reasonable and we do have API and subscribing mechanism. But if we need to take monitor on human operation, a hook in file system level is needed, that's why we choose inotify.

Could you tell me why a huge tree of files is not recommended? Accoding to data i found, inotify is improved with the limit of file descriptor, not like dnotiy.

xlotlu commented 6 years ago

@NoneGG yes, on python < 3.5 it is just as slow as before. I made some benchmarks which you can find attached to the PR.

If you need < 3.5 support, then you need to depend on scandir and do the loop just like in the old code. If you create a pull request that does this I'll close mine.

But if we need to take monitor on human operation, a hook in file system level is needed, that's why we choose inotify.

I see. I didn't imagine you'd have arbitrary, human-driven modifications. If so you have no other option, short of making sure all those modifications go through a custom application.

Could you tell me why a huge tree of files is not recommended? Accoding to data i found, inotify is improved with the limit of file descriptor, not like dnotiy.

I didn't say that - a huge tree of files is probably the best way to handle your storage needs. It's the inotify that I think is not the right tool for the job, because it's meant to monitor individual files / directories, while what you want is to monitor "everything". Because of its design you first have to visit every existing inode, which will take a lot of time, no matter what. Then you have to set up watchers that will consume memory. And then those watchers will consume cpu cycles at every event. It's true that it's efficient, but you're still dealing with a million-entries hash-table.

Maybe you could approach this from the other direction: monitor for "everything", and filter out the events that you're interested in? The kernel's audit system comes to mind, and it can monitor specific paths. There's also fanotify, but I don't think it fits your requirements.

NoneGG commented 6 years ago

@xlotlu Thanks for your advice, i will take audit system into consideration~

According to experiment in our CDN server, it do takes long time to initialize inotify tree(that's why i open this issue), but when refering to CPU and memory, it does not take much indeed.(i am not so sure, i only use top and free command to monitor these two indexes)