NNPDF / reportengine

A framework for declarative data analysis
https://data.nnpdf.science/validphys-docs/guide.html
GNU General Public License v2.0
1 stars 2 forks source link

Adding ability to cache general function returns #25

Closed siranipour closed 4 years ago

siranipour commented 4 years ago

So I found it really annoying where I had vp functions that would constantly do the exact same thing but take absolutely ages.

I've written a wrapper that basically caches function returns as pickle objects. I'm happy to delete this if people deem it not to be unnecessary.

Basic usage is as follows:

In [1]: from reportengine.caching import cache_to_file                             

In [2]: @cache_to_file() 
   ...: def square(iterable): 
   ...:     return [i**2 for i in iterable] 
   ...:                                                                            

In [3]: square([1,2,3,4])                                                          
Out[3]: [1, 4, 9, 16]

And if you check /tmp (default) you should have a file 'square(([1, 2, 3, 4],), {})' that cache'd the result.

If we want to keep it I'll write proper doc strings for merging.

Zaharid commented 4 years ago

I've had several ideas on how to do this. I think the best one is the "resource system" discussed here https://github.com/NNPDF/nnpdf/issues/224. A simple minded version of it was used for the alpha_s fits. I have been arguing for a while that it is a high priority for development. The idea is that you write the work in two stages, the one that does the hard computation and the one that does the analysis. The compute intensive stuff would be serialized properly, versioned properly and uploaded to the server, in a way that can be suitably tracked.

Overall the problem I have with pickling everything is that the result is most likely going to be wrong in a way that it is difficult to debug. What happens if you change what square does and inadvertently load the wrong result from some tmp folder? Or if some input resource changes in any way.

Currently an alternative to do development, is to keep the annoying to compute thing on a live session (like a notebook), using the API.

I would implement this using shelve rather than pickle. We had something like this in an older code, see e.g. https://github.com/scarrazza/smpdf/blob/f74efb7cab034114c33ca695f5c305254493afe2/src/smpdflib/lhio.py#L108 and one reason it was not ported was that there were hard to debug errors using it. Of course most things were much less stable and mature then, so it might not be that bad in practice, but then again we can as well implement the resources idea with not that much more work.

siranipour commented 4 years ago

Overall the problem I have with pickling everything is that the result is most likely going to be wrong in a way that it is difficult to debug. What happens if you change what square does and inadvertently load the wrong result from some tmp folder? Or if some input resource changes in any way.

Addressing the point on if the function changes, why not pickle the function too and check they are the same before returning the cached result?

Will look into shelve and NNPDF/nnpdf#224

Zaharid commented 4 years ago

Overall the problem I have with pickling everything is that the result is most likely going to be wrong in a way that it is difficult to debug. What happens if you change what square does and inadvertently load the wrong result from some tmp folder? Or if some input resource changes in any way.

Addressing the point on if the function changes, why not pickle the function too and check they are the same before returning the cached result?

I guess one problem is that you would need to save the entire graph of dependencies. Also the stdlib pickle module saves the name and module of the function rather than the bytecode. One would have to use something like cloudpickle for that.

siranipour commented 4 years ago

Sure, but these problems seem tackle-able.

Zaharid commented 4 years ago

Note also that str(args) is not necessarily a good enough key. Things might have very large string representations, or string representations that are not unique enough. However you could use spec_to_nice_name as in e.g. https://github.com/NNPDF/reportengine/blob/4defa85fa36a7ad664925b5927c7962ed69a0d80/src/reportengine/figure.py which has the purpose of giving an unique key per action (it is more of a purpose than a reality, but still).

siranipour commented 4 years ago

Note this fails for things like TheoryIDSpec which have no __eq__ method

Zaharid commented 4 years ago

Why is that

from validphys.loader import Loader                                                                           

l = Loader()                                                                                                  

t53 = l.check_theoryID(53)                                                                                    

t52 = l.check_theoryID(52)                                                                                    

t52 == t53                                                                                                    
False

t52 == t52                                                                                                    
True
siranipour commented 4 years ago

Hmmm maybe there is some other reason it fails.

siranipour commented 4 years ago

Ok, I'm confused now. According to this python objects should by default return False if __eq__ is not implemented. Indeed this is what happens when reading in the pickled version. If I have a function like

@cache_to_file()
def foo(theory):
    return theory

Then I can't read the cached result because the if cached_args == args fails (despite being the same object).

Btw, shelve seems to be a bit funny about which dbm module it uses and so works in weird ways where it generates 3 distinct files when writing to disk, see here

siranipour commented 4 years ago

Ok I found this library that seems to work nicely for exactly this kind of stuff.

From what I can tell it works a treat. Happy to use this instead of reinventing the proverbial wheel

Zaharid commented 4 years ago

The problem with theoryid is that indeed it doesn't implement __eq__ (I thought it did so my comment above was completely misleading). Equality then defaults to identity (i.e. objects pointing to the same location in memory are identical, but different objects made of the same attributes are different).

We probably should fix theoryid, but now that I recall it probably doesn't have __eq__ on purpose, because it is not so clear how the path should be treated upon equality comparison, and indeed it depends on the context (of course I am sure that other objects happily save the path). Just the i.d. is probably good as a cache key, but you want to know which path you are loading tables from.

One approach I have been thinking on is to map actions to input somehow (e.g. you would somehow hash on {'theoryid': 53}). This is in my plans for reportengine 1.0 (but I have too many plans... cc @wilsonmr ). It is basically the same problem as to decide which node in the graph is unique so you don't need to execute it again. At the moment, this is decided based on namespaces, which is limited in many ways. Of course within reportengine it is simpler because you don't have to worry about code being changed and so on.

Zaharid commented 4 years ago

re shelve, I think it should be fine to treat it as a black box. It is likely to be better that writing a bunch of files.

siranipour commented 4 years ago

What did you think of joblib.Memory? I quite like it tbh

Zaharid commented 4 years ago

I think it mostly saves the decorator you made, but that is really not the problem here. The problems are: