Persistent cache with Zarr

alimanfoo commented 7 years ago

This is somewhat related to both #5 and #3 but slightly different. Basically I would like a persistent cache for use when working in a jupyter notebook. The motivation is very similar to https://github.com/rossant/ipycache i.e. if I restart a notebook I don't want to have to repeat any computations that previously finished. However I would like to use Zarr to store cache results not pickle because compression will save disk space. Also I would like to use a memoize function decorator rather than a cell magic, i.e., something more like the cachey memoize decorator and the joblib Memory.cache decorator.

No problem if this is beyond scope for cachey but I thought I'd mention it in case there were any synergies with other requirements. On the technical side there are two main points to consider: one is how to generate a key from function arguments that is stable across python sessions (i.e., doesn't rely on Python's built-in hash function); the second is how to integrate with Zarr (or similar) for storage.

mrocklin commented 7 years ago

Zict may also be useful here

mrocklin commented 7 years ago

It seems like the outputs of a jupyter notebook would be more general than Zarr generally consumes. Is this correct?

alimanfoo commented 7 years ago

Yes, Zarr currently could only cache arrays, although you could hack around to also store scalar values.

FWIW I just hacked something up for my own use, code here: https://gist.github.com/alimanfoo/ed724d207d859ac507696c3f6735fdf9

mrocklin commented 7 years ago

Any thoughts on http://zict.readthedocs.io/en/latest/

alimanfoo commented 7 years ago

The composability of zict is very nice. But for my use case I am mostly wanting to cache results which are numpy arrays, and so I wondered about using Zarr because serialization/deserialization of arrays is very efficient. Also zict doesn't provide a memoize decorator, although it could be adapted to do so. Maybe talking at cross-purposes though, not sure where you were thinking zict to fit in.

On Tuesday, January 17, 2017, Matthew Rocklin notifications@github.com wrote:

Any thoughts on http://zict.readthedocs.io/en/latest/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-273213335, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qjh6jYxq3tR6Uk7QuD57Mmf3Es4rks5rTOclgaJpZM4LlbmQ .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 7 years ago

Generally I think the way to go about this is to let cachey consume a MutableMapping rather than construct its own dict. Then all of the choices of compression/serialization etc. fall on the user. Zict composes nicely here because it could be used to construct MutableMappings, even from zarr functions.

One could do this now.

from cachey import Cache
c = Cache(...)
c.data = my_mutable_mapping

alimanfoo commented 7 years ago

OK, that makes sense.

The other issue is about making persistent keys, i.e., keys that are stable across different Python sessions. In particular, for the memoize decorator, this means making a stable hash of positional and keyword arguments. Would you see this as in scope for cachey?

In joblib.Memory the general approach is to pickle arguments then use hashlib (i.e., cryptographic hashing), although there's lots of extra logic for handling arrays, sets, and trying to normalize positional and keyword args, which I think is a bit over-complicated. In the zarr_cache.py gist I just linearized positional and keyword arguments then pickled and passed through md5, but designed so other methods of producing keys could be plugged in.

On Tuesday, January 17, 2017, Matthew Rocklin notifications@github.com wrote:

Generally I think the way to go about this is to let cachey consume a MutableMapping rather than construct its own dict. Then all of the choices of compression/serialization etc. fall on the user. Zict composes nicely here because it could be used to construct MutableMappings, even from zarr functions.

One could do this now.

from cachey import Cache c = Cache(...) c.data = my_mutable_mapping

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-273250737, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qrpixb_63xbbBdnL-FXG1BMG9SI7ks5rTQRVgaJpZM4LlbmQ .

mrocklin commented 7 years ago

Is it possible to keep key generation within the choice of MutableMapping? At the moment cachey consumes python objects for keys (args, kwargs). Presumably it asks the MutableMapping for the value at a key like this and the MutableMapping transforms those python objects into the sort of key that it needs in order to access its internal data structure. If its internal data structure is just a dict then this is trivial. If it is an on-disk file then it has to be more complex as you suggest. At the end of the day though, I think that cachey might be able to say "I don't care. It's up to the MutableMapping how it wants to handle this."

alimanfoo commented 7 years ago

Yes I guess much of this could be pushed behind the MutableMapping interface (it hides a multitude of sins :-). The only thing then to standardize on the cachey side is how to form a key from args and kwargs. Did you mean to just use a tuple of (args, kwargs)?

On Wednesday, January 18, 2017, Matthew Rocklin notifications@github.com wrote:

Is it possible to keep key generation within the choice of MutableMapping? At the moment cachey consumes python objects for keys (args, kwargs). Presumably it asks the MutableMapping for the value at a key like this and the MutableMapping transforms those python objects into the sort of key that it needs in order to access its internal data structure. If its internal data structure is just a dict then this is trivial. If it is an on-disk file then it has to be more complex as you suggest. At the end of the day though, I think that cachey might be able to say "I don't care. It's up to the MutableMapping how it wants to handle this."

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-273474075, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QlFYUbisSi7TjAJMQ1Joif-DM9_9ks5rThKLgaJpZM4LlbmQ .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 7 years ago

Yeah, it looks like currently the memoize decorator takes in a key function and that key function takes in args and kwargs. Here is the default:

def memo_key(args, kwargs):
    result = (args, frozenset(kwargs.items()))
    try:
        hash(result)
    except TypeError:
        result = tuple(map(id, args)), str(kwargs)
    return result

This is not exactly the best possible answer. Using dask.base.tokenize here might be a good idea:

In [1]: from dask.base import tokenize

In [2]: tokenize([1, 2, 3], {1: 2, 3: {4}})
Out[2]: 'a46c1482d53199982e0211802af8dee6'

In [3]: tokenize([1, 2, 3], {1: 2, 3: {4}})
Out[3]: 'a46c1482d53199982e0211802af8dee6'

alimanfoo commented 7 years ago

The tokenize function from dask looks like a nice solution to the key generation problem.

One question, how does a token get generated for an arbitrary Python object? Couldn't make sense of this:

@normalize_token.register(object) def normalize_object(o): if callable(o): return normalize_function(o) else: return uuid.uuid4().hex

On Tuesday, January 24, 2017, Matthew Rocklin notifications@github.com wrote:

Yeah, it looks like currently the memoize decorator takes in a key function and that key function takes in args and kwargs. Here is the default:

def memo_key(args, kwargs): result = (args, frozenset(kwargs.items())) try: hash(result) except TypeError: result = tuple(map(id, args)), str(kwargs) return result

This is not exactly the best possible answer. Using dask.base.tokenize here might be a good idea:

In [1]: from dask.base import tokenize

In [2]: tokenize([1, 2, 3], {1: 2, 3: {4}}) Out[2]: 'a46c1482d53199982e0211802af8dee6'

In [3]: tokenize([1, 2, 3], {1: 2, 3: {4}}) Out[3]: 'a46c1482d53199982e0211802af8dee6'

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-274895530, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QvhNRIJfmaATNf0URSJtrWj_avTwks5rVkWPgaJpZM4LlbmQ .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 7 years ago

We serialize functions with cloudpickle and then hash them down. For objects that we don't know how to hash we generate a random string. In the context of caching this doesn't make sense and we would just want to not cache.

alimanfoo commented 7 years ago

Would it not be an option to fall back to pickling objects then hashing the pickle string?

On Wednesday, January 25, 2017, Matthew Rocklin notifications@github.com wrote:

We serialize functions with cloudpickle and then hash them down. For objects that we don't know how to hash we generate a random string. In the context of caching this doesn't make sense and we would just want to not cache.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-275099399, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Qh4tDOGZsOOLVAl5tuuD6D7YXfBoks5rV0QbgaJpZM4LlbmQ .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

mrocklin commented 7 years ago

That would certainly be an option. We probably won't do it in Dask for speed and robustness reasons. We're pretty comfortable with false negatives.

alimanfoo commented 7 years ago

OK, so a possible architecture...

Cachey provides a Cache class.

You can set any MutableMapping as the value of the data property on an instance of Cache.

A Cache instance provides a memoize decorator, which uses (args, kwargs) as the key when setting and getting items from the cache. It's up to the data mapping how to transform (args, kwargs) into a key used internally.

To get caching which works across Python sessions, implement a zict-style composable MutableMapping class (call it Hash for now) which does the job of turning (args, kwargs) key into some hashed key and then passes through get/set operations to an inner MutableMapping which handles storage.

To get persistent storage of cached values, implement a storage MutableMapping class (call it Store for now) which uses whatever strategy to save values to disk. This could be using Zarr only save any array-like values and barf on any non-array-like values, or it could use Zarr to save array-like values and some other strategy to save other types of values. Point is, details of persistence are hidden behind this layer.

So then you could do something like:

cache = cachey.Cache()
cache.data = Hash(Store('/path/to/cache'))

@cache.memoize
def myfunc(...):
   ...

The Hash and Store mappings could be combined into a single class, but then you would lose flexibility to combine different hashing and storage implementations.

To generalize, there is a three-layer Cache/Hash/Store architecture. The top layer Cache class could be cachey.Cache or zict.LRU or something simple that caches everying. I'm assuming the top layer Cache class also provides the memoize decorator, or maybe that could be decoupled, but needs to live somewhere.

Is this sounding like something worth pursuing? In Cachey? In Zict?

What I need for my own use case is a simple, flexible way to memoize a function that efficiently stores array-like and scalar return values and works across Python sessions. The additional logic of choosing when to evict cached values based on computational cost is actually a bit unnecessary when storing cached values on disk, although maybe not always if amount of stored data is large.

mrocklin commented 7 years ago

In general I like deferring choices to mutable mappings (as you well know). One question I have about the above procedure is if .data can just be a plain dict. In particular, it will probably we probably can not send just (args, kwargs), but some transformation on (args, kwargs) that is definitely hashable.

Perhaps we just punt on any non-hashable input? This would removes our ability to cache mutable data structures (like lists or numpy arrays) by default.

So, I'm +1 on deferring custom decisions to other user defined objects, but -1 on requiring users to construct MutableMappings in the common case.

alimanfoo commented 7 years ago

Yes definitely in the common case the user should not have to construct any mutable mappings. Should be easy to achieve as you suggest via some minimal transformation on (args, kwargs) in the memoize decorator.

On Wed, 25 Jan 2017 at 19:25, Matthew Rocklin notifications@github.com wrote:

In general I like deferring choices to mutable mappings (as you well know). One question I have about the above procedure is if .data can just be a plain dict. In particular, it will probably we probably can not send just (args, kwargs), but some transformation on (args, kwargs) that is definitely hashable.

Perhaps we just punt on any non-hashable input? This would removes our ability to cache mutable data structures (like lists or numpy arrays) by default.

So, I'm +1 on deferring custom decisions to other user defined objects, but -1 on requiring users to construct MutableMappings in the common case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-275206821, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QmIKD2xG4Zie-uiGfM4WRfGJdql6ks5rV6GOgaJpZM4LlbmQ .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

nbren12 commented 7 years ago

I just wanted to comment that this would be a very useful feature for me. Currently, I am automating my computational workflows using Snakemake and make, which then calls python scripts which use dask. The advantage of this is that I can "checkpoint" my the analysis at certain locations in case the workflow fails or because the output of these intermediate steps is interesting. The disadvantage is that I am basically reimplementing the dask graph in my Makefiles, which is very tedious since I have to write file IO boiler-plate for every step.

Solving this issue could make dask a suitable replacement for tools like Snakemake, nextflow, make, etc.

mrocklin commented 7 years ago

Noah, have you taken a look at the caching that Dask already provides? http://dask.pydata.org/en/latest/caching.html

On Wed, Mar 22, 2017 at 2:29 PM, Noah D Brenowitz notifications@github.com wrote:

I just wanted to comment that this would be a very useful feature for me. Currently, I am automating my computational workflows using Snakemake and make, which then calls python scripts which use dask. The advantage of this is that I can "checkpoint" my the analysis at certain locations in case the workflow fails or because the output of these intermediate steps is interesting. The disadvantage is that I am basically reimplementing the dask graph in my Makefiles, which is very tedious since I have to write file IO boiler-plate for every step.

Solving this issue could make dask a suitable replacement for tools like Snakemake, nextflow, make, etc.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-288495728, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHmCHLccezJVcOmtSPcQLHh0UNOFks5roWifgaJpZM4LlbmQ .

nbren12 commented 7 years ago

I have glanced at that page, but it seemed to me that that is in-memory caching only.

I basically want to be able to manually specify the nodes of a dask graph which should be saved to disk. I deally, my desired syntax would look something like this:

A = ...
B = really_expensive_or_interesting_computation(A).cached_to_disk("B")
C = moderately_expensive(B) 
C.to_hdf5("final_output.h5")

Can the link you sent me be used for that?

mrocklin commented 7 years ago

Perhaps with an on-disk mutablemapping like shelve or chest. Regardless, I don't think that this is related to Zarr, so we should probably move the discussion to a different issue. I recommend starting an issue in dask/dask, perhaps with a simple example of what you would like to achieve.

On Wed, Mar 22, 2017 at 2:38 PM, Noah D Brenowitz notifications@github.com wrote:

I have glanced at that page, but it seemed to me that that is in-memory caching only.

I basically want to be able to manually specify the nodes of a dask graph which should be saved to disk. I deally, my desired syntax would look something like this: A = ... B = really_expensive_or_interesting_computation(A).cached_to_disk("B") C = moderately_expensive(B) C.to_hdf5("final_output.h5")

Can the link you sent me be used for that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/cachey/issues/7#issuecomment-288498438, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBjMpRroaAZsBN4MQ-apqCKONrAKks5roWq0gaJpZM4LlbmQ .

nbren12 commented 7 years ago

Okay, I might start another issue, but the memoization syntax that @alimanfoo proposes above would also probably handle my use-case pretty nicely.

nbren12 commented 7 years ago

Yah, I would be more than happy with

B = cache.memoize(really_expensive_or_interesting_computation)(A)

majidaldo commented 3 years ago

I'm a fan of joblib.memory. You can implement storage backends with it. But I'd like for something 'smart' to automatically choose the computations to persist.

There's graphchain to look at but I'm not sure what it adds over dask's caching except for the ability to persist. I also don't know how current it is.

dask / cachey

Persistent cache with Zarr #7