Open hXtreme opened 2 years ago
Hi @hXtreme, thanks for the great idea! I'm generally open to PR contributions for good ideas, just don't have much time to implement new things myself.
On your idea, I have a few questions on how this might work. At a high level, the challenge is to (efficiently) detect when a result is cached versus not cached. You could do this I think at either the level of a pipeline stage (e.g., map(expensive_function)
or for each individual element (so basically ergonomics on top of memoization). The challenge in the first case is knowing when the entire input has already been cached, without scanning it again. In the second case, does this amount to some ergonomics on top memoization that is serialized to disk? The second one could very well be nice to have, but would be great to get a bit more clarification around how the decision on whether to compute or use cache would be done.
This is what I'd think it works like:
# Untested Code
from pathlib import Path
from typing import TypeVar
import dill
import functional as funct
from functional.pipeline import Sequence
T = TypeVar("T")
def with_caching(seq: Sequence[T], file: str, path: str = "~/.cache") -> Sequence[T]:
path = Path(path)
path.mkdir(exist_ok=True)
cache_file = path / file
if cache_file.exists():
cache = dill.load(file=cache_file, ignore=False)
return funct.seq(cache)
else:
results = seq.to_list()
dill.dump(results, file=cache_file)
return seq
This is an external function (because I don't know enough about PyFunctional to be able to modify it)
It is used as follows:
expensive_result = with_caching(s.map(expensive_function), file="expensive_result", path="./experiment-cache")
Hope this helps.
I also like the idea but I think there are some things to take into account:
the input could change:
Assuming all of the above would be taken care of, your expensive_function could get an updated implementation.
So the hashing should take into account the input that is processed at the caching point but also the code of the expensive_function.
I think JobLib does exactly what you are after taking into account all of the above: https://joblib.readthedocs.io/en/latest/memory.html
>>> cachedir = 'your_cache_location_directory'
>>> from joblib import Memory
>>> memory = Memory(cachedir, verbose=0)
>>> @memory.cache
... def f(x):
... print('Running f(%s)' % x)
... return x
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
It would be awesome if there was an easy way to cache results to a file and if the cache was found then load from the cache without recomputing.
To basically support the following usecase/api
First run:
Second run (the first run should have created the cache file which will now be used instead of recomputing the sequence
expensive_result
):Please let me know if you have any questions or would like some clarifications.
Loving the project so far, thanks for your effort!