UDST / orca

Python library for task orchestration
https://udst.github.io/orca/
BSD 3-Clause "New" or "Revised" License
53 stars 21 forks source link

dynamic determination of invalidated cache #15

Open Eh2406 opened 7 years ago

Eh2406 commented 7 years ago

can we keep track of when we generate a orca column and use that to determine when a orca table needs to be re-calculated.

fscottfoti commented 7 years ago

You mean like have a dependency graph? I definitely thought that was a promising idea but Matt thought it would be a lot of work at the time...

Eh2406 commented 7 years ago

To be honest I assumed we had a DG, or the data to build one on the fly. I assumed that is what is needed to do lazy loading. I will go back to the code. How do we do lazy loading?

When we add data to orca, store a timestamp (or monatomic number) in a _CACHE_TIME. However we do lazy loading, rerun that code and see the largest value for the _CACHE_TIME of the dependencies. If the dependencies time > than the time for the cashe of this item, than we are in a invalidated cache state.

fscottfoti commented 7 years ago

No there's no DG - lazy evaluation happens when an item gets injected into a function. If it's in the cache it doesn't get re-evaluated. The user has to invalidate the cache manually, which is usually done every iteration, and is why my other issue about the cache is so important. It's not that we didn't think DG was a good idea, just that Matt never got to it.

Eh2406 commented 7 years ago

Does the algorithm I suggested make sense? Do you think it would improve things?

As always just starting a discussion, and willing to help implement any fix.

fscottfoti commented 7 years ago

I definitely agree that building a DF and invalidating the cache appropriately when something gets updated would be awesome. The problem is that your dependency graph is usually built based on what gets injected into the function, but everything that gets injected is not necessarily used. So like when you inject the parcels table which has 20 columns, you only update one column in the parcels table you want to invalidate only the things that depend on that column, but you're injecting the whole table. I see no obvious way to solve this other than requiring the user to say which columns a function depends on explicitly, but then we're almost as prone to error as we were in the first place. Thoughts?

Eh2406 commented 7 years ago

It can also be subverted by using orca.get_table in the function to make a dependency without telling orca, or by using a different global variable / file to store some state. Generally python is permissive with privacy and mutability for better and worse. (off topic, I am excited by timely dataflow for rust. It is leveraging rust's stinginess with privacy and mutability to insure a lot of the invariants we are discussing.)

Perhaps this algorithm can be a diagnostic tool? Still thinking...

fscottfoti commented 7 years ago

Right - that's the best I can come up with too... orca knows when it's inside that method, and it can capture any calls to get_table or __getattr__ and know which columns are a dependency. My fear is that you can have different code paths through the method - like an if/else - so you can't know you've got ALL the dependencies just by running the function once. And that's where I get stumped.

bridwell commented 7 years ago

I would say that stating specific column dependencies, while cumbersome, is still preferable to having to manually clear the cache within the steps. But maybe this could be automated in some way, like looking through the contents of a yaml file or configuration dictionary rather than being declared in the decorator?

waddell commented 7 years ago

@bridwell that is my inclination as well, to be explicit about dependencies. A bit more work up front, but with good payoff in terms of being able to address challenges like identifying dynamically when to recompute based on an updated dependency.