con / fscacher

Caching results of operations on heavy file trees
MIT License
0 stars 2 forks source link

memoized_path_copy helper to complement @memoize_path #57

Open yarikoptic opened 2 years ago

yarikoptic commented 2 years ago

In the light of https://github.com/dandi/dandi-cli/issues/848 discussion to allow for more efficient caching of digests, I wondered if it would be feasible to provide something like memoized_path_copy which would copy all (?) memoized invocations for a specific decorated function as they were invoked for another "new" path.

ATM, looking at the code, and since we rely on joblib memoization and otherwise do not track what specific parametrizations of the function were used, I really do not see how we could even do that. But may be you @jwodder see some way to provide such functionality?

jwodder commented 2 years ago

@yarikoptic I'm not entirely clear on the behavior you're describing. Do you mean that a thus-decorated function should detect copies (How?) and memoize them as though they were the original path?

yarikoptic commented 2 years ago

I was thinking about something like if we have

@cache.memoize_path
def decorated_func(path, ...):
   .... whatever ...

and decided to copy file from src_path to dest_path (well, could also be "move" instead of "copy"), we could do

copy(src_path, dest_path)
cache.memoized_path_copy(decorated_func, src_path, dest_path)

which would then copy all memoized/cached invocations for the decorated_func for the src_path so they would also be known for dest_path

jwodder commented 2 years ago

@yarikoptic This might be possible depending on the underlying functionalities of joblib; I've brought this possibility up in a related issue there.

yarikoptic commented 2 years ago

not sure if we would see desired development in joblib done/accepted in the nearest future... may be only if we send a PR for some alternative (probably based on FileSystemStoreBackend) backend which would provide desired interfaces/functionality. Meanwhile tried already existing interface to get information about all entries in the cache:

(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ time python3 -c 'from dandi.support.digests import checksums; c = checksums._memory.store_backend.get_items(); print(len(c)); print(c[0]);'
55341
CacheItemInfo(path='/home/dandi/.cache/fscacher/dandi-checksums/joblib/dandi/support/digests/get_dandietag/75ce6b526d6e61faac02b4164ac645c5', size=641, last_access=datetime.datetime(2021, 6, 29, 21, 55, 31, 925287))

real    0m3.325s
user    0m2.399s
sys     0m1.141s

and that was a "warm" run, original one was probably twice longer. But it is on drogon which "saw too much" (over 50k entries) and for a typical user, and probably having mv not that common -- this should be ok. So we can identify cache entries associated with a path easily and through an existing interface. The question would be either it would be possible to copy them into a new entry (with adjusted path and last_access)?

jwodder commented 2 years ago

@yarikoptic Copying modified entries depends on too many implementation details of joblib.Memory which, at best, are managed via functions with no public documentation whose names start with underscores. If we want to be able to do this reliably, we need cooperation with joblib; see https://github.com/joblib/joblib/issues/1237 or start a new issue.