yarikoptic commented 4 years ago

High level overview

An increasing number of projects need to operate on growing large collections of files. Having (hundreds of ) thousands of files within a file tree is no longer atypical e.g. in neuroimaging. Many supplementary tools/functions need to "walk" the file tree and return result of some function. Such result should not change between invocations if nothing in the file tree changes (no file added/removed/changed) and theoretically output of the function should remain the same.

Situation scales down all the way to individual files: function could operate on a specific file path and return result which should not change if file does not change.

"Fulfilled" use-case

git already does some level of caching to speed up git status (see comment in datalad below), so I think it should be possible to provide some generic helper which would cache the result of operation on some path argument (file or a tree).
dandi-cli: loading metadata from .nwb file which takes a while. For that purpose, based on joblib.Memory I came up with a PersistentCache class to provide a general persistent (across processes) cache, and a .memoize_path method which is a decorator to decorate any function which takes path to a file as the first argument.

Implementation

Dunno yet what to do given a file tree, and either there is filesystem specifics involved to be (ab)used. Probably should not rely on active monitoring (e.g. inotify) to not depend on some daemon to be running.
That PersistentCache mentioned above also can account (see example) for versions of the relevant python modules placing them into signature, so if there is an update, cached result will not be used.
probably should accept a list of paths changes to which could be safely ignored
consider support to account for hierarchical organization of files/directories -- so it would be the caching helper which would cause re-invocation only on relevant sb-trees while reloading cached for the others which had not changed

Target use-cases

DataLad: datalad status and possibly some (e.g. diff) other commands operation. Whenever git status already uses some smart caching of the results, so subsequent invocation takes advantage of it, there is nothing like that within datalad yet.
PyBIDS: construction of the BIDSLayout could take awhile. If instance (or at least a list of walked paths) could be cached -- would speed up subsequent invocations. (ref: https://github.com/bids-standard/pybids/issues/609#issuecomment-627492261)
dandi-cli: ATM we cache per file, but as larger in number of files datasets appear, we might like to cache results on full file trees

If anyone (attn @kyleam @mih @vsoch @con @tyarkoni @effigies @satra) knows an already existing solution -- would be great!

vsoch commented 4 years ago

Have you looked at robinhood? We use it for one of our storage offerings. https://github.com/cea-hpc/robinhood/wiki.

satra commented 4 years ago

the ones i know are proprietary, but more importantly almost all of them rely on kernel level and daemon processes. you can do a few things locally as a user, but they are relatively slow. it also depends on what the exact use cases are. nina's husband don is the cto of https://starfishstorage.com/ - one of those proprietary offerings.

a lot of this will depend on scale, filesystem, etc.,.:

we look at this from both infrastructure of hpc standpoint (a few PBs across different types of filesystems) and my group standpoint (about 200TB of data).
very different scales relative to say running fitlins/fmriprep on a given smallish project.
metadata handling is very different in lustre vs nfs vs others.

so i think it would be good to describe what your immediate scope is (number of inodes, userspace constraints, etc.,.).

yarikoptic commented 4 years ago

Thank you @vsoch and @satra for feedback. I would say that each case would be in up to some hundred thousand of files spread around sqrt() of that many directories -- whatever git can manage more or less ;)
But even in the case of up to a thousand of files such a solution could provide benefits in many cases even if to provide convenient API to take advantage of for "file system time stamps based caching". I do not think data size should matter per se at all. In many scenarios actual operation to be performed on a file could take longer than stating a file, so then convenience of per file caching like we have in dandi could already help, I just want to generalize it a bit. What I am aiming here at first is to get some high level convenience API which would allow for underlying implementation to may be be quite dumb and not as efficient as it could may be only take advantage from async operations etc while performing traversal of the tree. If things could later be improved with some heavier tricks (file system dependent monitors etc) in some cases -- great, but if not, that is ok as well ;-)

vsoch commented 4 years ago

If you need a chonker repository to test I can offer https://github.com/vsoch/dockerfiles - it doesn't have large files, but has ~100K files (version 1 had ~130K) and they are Dockerfiles so it's a consistent thing to search (or generally run functions) over. I think it would be worth poking robinhood to ask for a pointer to a sqlite implementation - that seems to be a "scaled production" solution that might be modified to work with a little, local filesystem. But the tradeoff for such a solution is that it requires you to add more layers of dependencies - using such a helper probably would require more than a pip install datalad.

satra commented 4 years ago

py3 uses os.scandir underneath which signficantly speeds up traversal.

that's not too bad for 143K folders containing files

In [27]: len(dfile)                                                                                        
Out[27]: 143486

In [28]: %time dfile = [files for _, _, files in os.walk('dockerfiles')]                                   
CPU times: user 2.1 s, sys: 4.94 s, total: 7.04 s
Wall time: 7.95 s

yes, one would need to build up a few more things, but it seems even just a stat is relatively quick:

In [36]: %time dfiles = [[os.stat(os.path.join(dp, val)) for val in files] for dp, _, files in os.walk('doc
    ...: kerfiles')]                                                                                       
CPU times: user 2.75 s, sys: 6.87 s, total: 9.62 s
Wall time: 10.3 s

yarikoptic commented 4 years ago

Thank you @satra ! I have tried it out on a number of cases of interest and it all looks super promising and possibly sufficient for what I wanted! FWIW your nice one liner is now a very ugly script which I had composed to quickly compare (it was late and I thought it would be just a quick hack so just did it in plain vim ;)) different possible tune ups is at https://gist.github.com/yarikoptic/7284b634d8ab12277a3d316a117cc0db . Here is a sample output from a tricky case where depending on the use case could be made notably faster (by ignoring all .git/annex/objects and .git/objects) which I have not tried yet:

(git)smaug:/mnt/btrfs/datasets/datalad/crawl/openneuro[master]git
$> (set -x; python3.7 ~datalad/trash/walk.py statdir; time git status; time git status; time datalad status; )
+/bin/zsh:1808> python3.7 /home/yoh/proj/datalad/trash/walk.py statdir
Using statdir
Total 106247:212494 took 43.66118 cold 5.34998 warm
  8.71s user 11.54s system 40% cpu 49.553 total

+/bin/zsh:1808> git status
On branch master
nothing to commit, working tree clean
git status  1.99s user 3.99s system 50% cpu 11.839 total

+/bin/zsh:1808> git status
On branch master
nothing to commit, working tree clean
git status  1.75s user 3.25s system 133% cpu 3.745 total

+/bin/zsh:1808> datalad status
nothing to save, working tree clean
datalad status  43.56s user 48.72s system 101% cpu 1:31.11 total

Cold run was quite long, but a) it is 106247 directories with 212494 files b) warm traversal was faster than first warmish git status! (and that one doesn't care about all those objects while walk.py did)

Because os.walk is just a pure Python around os.scandir it might allow for even slightly faster targeted implementation e.g. by reusing the same dir fd for os.stat, allowing for "hierarchical" handling (define boundaries in the hierarchy and act on each separately), possibly faster ignoring of paths, and possibly more efficient early termination (if/when paired with actual fingerprinting/comparison for having a change, probably would make most sense if paired with hierarchical handling to avoid too frequent comparisons and needing to carry prior fingerprints for full hierarchy).

I hope to get to it some time soon by RFing dandi-cli's helper to be able to handle directory paths and/or explicit lists of paths then I could try on PyBIDS and/or DataLad for .config reload , which is currently has its own dedicated implementation which imho could be replaced with this more "modular" one.

edit 1: this exercise made me into a believer that not all Python code is damn slow ;)

con / fscacher

Statement of the problem and review of possible existing solutions #1

High level overview

"Fulfilled" use-case

Implementation

Target use-cases