EntilZha / PyFunctional

Python library for creating data pipelines with chain functional programming
http://pyfunctional.pedro.ai
MIT License
2.4k stars 132 forks source link

Cache to file and auto-load from cache #169

Open hXtreme opened 2 years ago

hXtreme commented 2 years ago

It would be awesome if there was an easy way to cache results to a file and if the cache was found then load from the cache without recomputing.

To basically support the following usecase/api

# main.py

def expensive_function(x):
   """Consider an expensive to compute function"""
   print(x, end=" ")
   return x**2

s = seq([x for x in range(4)], cache_dir="/path/to/cache/dir")

expensive_result = s.map(expensive_function)\
    .with_caching(name="expensive_result")

expensive_result.for_each(lambda x: print(f"\n", x, end=""))

First run:

$ python main.py
0 1 2 3
0
1
4
9

Second run (the first run should have created the cache file which will now be used instead of recomputing the sequence expensive_result):

$ python main.py

0
1
4
9

Please let me know if you have any questions or would like some clarifications.


Loving the project so far, thanks for your effort!

EntilZha commented 2 years ago

Hi @hXtreme, thanks for the great idea! I'm generally open to PR contributions for good ideas, just don't have much time to implement new things myself.

On your idea, I have a few questions on how this might work. At a high level, the challenge is to (efficiently) detect when a result is cached versus not cached. You could do this I think at either the level of a pipeline stage (e.g., map(expensive_function) or for each individual element (so basically ergonomics on top of memoization). The challenge in the first case is knowing when the entire input has already been cached, without scanning it again. In the second case, does this amount to some ergonomics on top memoization that is serialized to disk? The second one could very well be nice to have, but would be great to get a bit more clarification around how the decision on whether to compute or use cache would be done.

hXtreme commented 2 years ago

This is what I'd think it works like:

# Untested Code

from pathlib import Path
from typing import TypeVar

import dill
import functional as funct
from functional.pipeline import Sequence

T = TypeVar("T")

def with_caching(seq: Sequence[T], file: str, path: str = "~/.cache") -> Sequence[T]:
    path = Path(path)
    path.mkdir(exist_ok=True)
    cache_file = path / file
    if cache_file.exists():
        cache = dill.load(file=cache_file, ignore=False)
        return funct.seq(cache)
    else:
        results = seq.to_list()
        dill.dump(results, file=cache_file)
        return seq

This is an external function (because I don't know enough about PyFunctional to be able to modify it)

It is used as follows:

expensive_result = with_caching(s.map(expensive_function), file="expensive_result", path="./experiment-cache")

Hope this helps.

bmsan commented 2 years ago

I also like the idea but I think there are some things to take into account:

So the hashing should take into account the input that is processed at the caching point but also the code of the expensive_function.

I think JobLib does exactly what you are after taking into account all of the above: https://joblib.readthedocs.io/en/latest/memory.html

>>> cachedir = 'your_cache_location_directory'
>>> from joblib import Memory
>>> memory = Memory(cachedir, verbose=0)

>>> @memory.cache
... def f(x):
...     print('Running f(%s)' % x)
...     return x
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.