Open mpu-creare opened 1 year ago
@jmilloy @joffreypeters Let me know what you think about the above proposal. No rush on this. I just wanted to get my thoughts down while I was having them.
@mpu-creare I think this looks reasonable.
I think there's no benefit to including a factory
input to the factory method - that's something someone can just as easily implement if they want to use one. I would also consider calling your make
method create
perhaps?
I do have some questions about how this would affect GeoWATCH - my impression is that it would touch virtually every class in some way - even if we preserved the default usage here, we'd still need to modify most or all constructors in the codebase - not a big lift, but would touch most files.
@joffreypeters create
is better, I'm terrible with names.
You would only need to modify the code in the places where classes are instantiated
. That might actually be a pretty small subset... but it depends how much optimization we want to do because this change would provide a lot of potential optimizations. (i.e. caching the raw data instead of interpolated version, interpolating as the last step for most of the pipelines, etc.)
I think it would be helpful if you provide a specific problem with the current caching so that it is clear if the proposed change solves that problem.
In theory, I'm strongly in favor of making the interpolation and caching more explicit, which I think the first part of your proposal does. The competing issue is the ease-of-use for simple or common cases, which is what I think the second part of your proposal is. The challenge here is to maintain explicit behavior without overburdening the user for simple use-cases. In general, a slightly more verbose API that requires more boilerplate can actually be less burden on users. For me, the factory
argument ventures into implicit or confusing behavior where the make method is trying to do two different things in the same method. I think the make
classmethod without the factory arguments is a reasonable API.
What about a Node
method that returns an Interpolation node and another than returns a Cache node:
class Node(...):
def cache(self, **kwargs):
return podpac.Cache(source=self, **kwargs)
def interpolation(self, **kwargs):
return podpac.Interpolation(source=self, **kwargs)
One-line usage would be:
node = podpac.data.Rasterio(source=<path>).cache().interpolation()
Not sure this is any better, especially in python syntax. So just another proposal that might keep the juices flowing or even clarify why the flat make
version is better.
Last thing. My podpac.Environment
proposal (https://github.com/creare-com/podpac/issues/451#issuecomment-850514165) comes to mind. The other alternative that I have tickling my brain is to push all caching into this environment. Your make
classmethod would be replaced with a interpolate
classmethod that does the same thing but only handles interpolation. The Environment
would inject caching at the desired node.
with podpac.Environment(...) as env:
data_node = Rasterio(source=<path>)
node = Interpolation(source=data_node, <interpolation args>)
env.cache(data_node, <caching args>) # inject caching at the data_node before interpolation
node.eval(...)
# or, with the factory:
with podpac.Environment(...) as env:
node = Rasterio.interpolation(source=<data>, <interpolation_args>)
env.cache(node.source) # inject caching before interpolation
node.eval(...)
This comes as a user spec and no claims about implementation.
One thing that stands out to me now that I write this (and thinking about representing this spec in the JSON pipeline, for example) is: does caching belong in the pipeline? It doesn't effect the results, only the running behavior just like threading and other execution controls. A JSON pipeline coupled with an execution environment makes sense to me, with caching included in the execution environment. In a serialized pipeline/environment, the nodes you want to cache could easily be listed by name.
The more I type this out, the more I'm on the side of keeping caching out of the pipeline, though I expect that there are strong practical reasons that this would become unwieldy that I don't remember. I don't mean that there wouldn't be a Caching
node in the implementation, but that in normal user-facing API usage, you would access it another way.
it depends how much optimization we want to do because this change would provide a lot of potential optimizations. (i.e. caching the raw data instead of interpolated version, interpolating as the last step for most of the pipelines, etc.)
Optimizations would be new features/updates, so those are just a bonus feature we'd be able to implement, not code we'd have to change to preserve current functionality. In that sense, I think the changes are pretty small, and the potential improvements with caching might be substantial.
@jmilloy I like your proposed interface much better:
node = podpac.data.Rasterio(source=<path>).cache().interpolation()
is nice and explicit, and it's clear where you pass arguments to the subsequent nodes. You're basically providing convenient functions that add Interpolation and Caching nodes. Very nice.
I think it would be helpful if you provide a specific problem with the current caching so that it is clear if the proposed change solves that problem.
See #219
does caching belong in the pipeline? It doesn't effect the results, only the running behavior just like threading and other execution controls.
That's a good design philosophy question. We also have a few manager
nodes that control where and how nodes are computed. You can insert a node that runs on the cloud inside a pipeline that's otherwise completely local. So those would fall under the same category as caching. Note that in some cases you might have a node that HAS to be run on a particular cloud API endpoint because that particular machine has a special implementation -- which makes it sort of a Datasource
node but not quite. It gets blurry, is my point.
From a practical standpoint, it's pretty nice to set up a system for a particular application that runs just the way you want it to. In that case building caching, parallel and cloud evaluations into the pipeline is quite convenient. It suits our main use cases very nicely.
From a sharing standpoint it's terrible -- when you ship the JSON off to another system, you're no longer just defining the product, but how it should be executed and stored. That feels like a breaking feature for the purposes of sharing. But, this doesn't match our main use case very often...
I'll have to think about that one.
@jmilloy I think having caching as a node or like that context manager would solve the propagate problem as well (https://github.com/creare-com/podpac/issues/451#issuecomment-850514165). When every action is a Node, you don't need to pass common traits down the chain anymore.
Regardless, getting caching out of the Node
class would solve a lot of the issues I run into with caching in general. Being more explicit about what's cached when would be a great boon.
I'm still curious about the necessary caching flexibility. Is the Cache
node necessary. Actually, does it even provides the necesary flexibility?
Currently, every node knows how to cache itself, it's just somewhat too hard to control which nodes are getting cached and how. For example, you often want easy built-in interpolation while optionally caching the source data.
A Cache
node makes that very explicit, but it's not at all easy to turn on and off (e.g. the SMAP example #219). I'd have to hear a little bit more about how the Cache
node addresses that case. I vaguely proposed specifying which nodes should be cached more clearly through an execution environment or other tagging mechanism. I think that addresses the issues more flexibly, but the implementation is a bit more uncertain.
Removing the InterpolationMixin really clarifies and control of data retrieval vs interpolation for the user. I'm curious if a single cache_source
attribute gives you what you need in most situations. That is, these are all the same
# explicit
data_node = Rasterio(source=path, cache=True)
node = Interpolation(source=data_node)
# use cache_source
data_node = Rasterio(source=path)
node = Interpolation(source=data_node, cache_source=True)
# one-liner proposal
node = Rasterio.interpolation(source=path, cache_source=True)
# two versions of the other one-liner proposal
node = Rasterio(source=path, cache=True).interpolation()
node = Rasterio(source=path).interpolation(cache_source=True)
It's more clear why you need this when you are defining actual data sources:
class MyDatasource(Rasterio):
source = MY_PATH
class MyInterpolatedDatasource(Interpolation):
source = MyRawDatasource()
# now you can easily use the data source with or without caching:
data_nocache = MyDatasource()
data_cached = MyDatasource(cache=True)
# and you can use the interpolated datasource with caching at any level
node_nocache = MyInterpolatedDatasource()
node_datacached = MyInterpolatedDatasource(cache_source=True)
node_interpcached = MyInterpolatedDatasource(cache=True)
node_bothcached = MyInterpolatedDatasource(cache=True, cache_source=True)
For completeness, the interpolation method proposals allow you to skip defining the interpolated class.
class MyDatasource(Rasterio):
source = MY_PATH
# classmethod version
node = MyDatasource.interpolation(cache_source=True)
# regular method version, two different ways that are the same
node = MyDatasource(cache=True).interpolation()
node = MyDatasource().interpolation(cache_source=True)
@jmilloy let's take this discussion over to #219, which is cache-specific. I think you're right in that we need to absolutely nail down the use-cases before we figure out the design.
The 3.x release has a few backward compatible features that we should eliminate for the sake of a better overall design.
InterpolationMixin
is leading to some very tricky bugs, and right now it introduced a multi-threading errorThis will give a much cleaner design and the user will have much more control over caching in general.
It comes at the cost of a more cumbersome API. For example, currently:
gives you a node that you can interpolate to any coordinate system with caching enabled -- this is generally what you want. With the new proposal you need:
So, instead, what if we made a new class method on
Node
that acts as a factory:where:
I'm not sure about the
factory
parts... seems like you could just create a function and do:factory(podpac.data.Rasterio(source=...), kw1=...)
Anyway, thoughts welcome!