4.0 Release Ideas - Githubissues

mpu-creare commented 1 year ago

The 3.x release has a few backward compatible features that we should eliminate for the sake of a better overall design.

Interpolation should be its own node class
- The InterpolationMixin is leading to some very tricky bugs, and right now it introduced a multi-threading error
Caching should be it's own node class
The JSON format is not JSON, this should be fixed.

This will give a much cleaner design and the user will have much more control over caching in general.

It comes at the cost of a more cumbersome API. For example, currently:

node = podpac.data.Rasterio(source=<path to dataset>)

gives you a node that you can interpolate to any coordinate system with caching enabled -- this is generally what you want. With the new proposal you need:

data_node = podpac.data.Rasterio(source=<path to dataset>)
cache_node = podpac.Cache(source=data_node)  # Flexibility lets you put that AFTER interpolation, if desired
node = podpac.Interpolation(source=cache_node)

So, instead, what if we made a new class method on Node that acts as a factory:

node = podpac.data.Rasterio.make(
    source=<path to dataset>,
    add_cache=True,
    add_interp=True,
    cache_interp_order="cache_first"
)
# or simply
node = podpac.data.Rasterio.make(True, True, source=<path to dataset>)

where:

class Node(...):
    ...
    @classmethod
    def make(cls, add_cache=None, add_interp=None, cache_interp_order="cache_first", factory=None, factory_kwargs=None, **kwargs):
        """
        Parameters
        -------------
        add_cache: bool, dict
                Default is None. If `True` or `dict`, will add a `Cache` node to the pipeline definition. If `dict`, the parameters in the dictionary
                will be passed on to the `Cache` node creation
        add_interp: bool, dict
                Default is None. If `True` or `dict`, will add an `Interpolation` node to the pipeline definition. If `dict`, the parameters in the dictionary
                will be passed on to the `Interpolation` node creation
        cache_interp_order: enum
                Default is "cache_first". Options are "cache_first" or "interp_first". This controls the order of nodes added to the source
                node when BOTH caching and interpolation are used.
        factory: callable
                Default is None. A function that will be called to create the class. If supplied, all other inputs are ignore and the 
                function returns: `factory(cls(**kwargs), **factory_kwargs)`.
        factory_kwargs: dict
                Key-word arguments that will be passed to the factory. Default is None. 

        """

I'm not sure about the factory parts... seems like you could just create a function and do: factory(podpac.data.Rasterio(source=...), kw1=...)

Anyway, thoughts welcome!

mpu-creare commented 1 year ago

@jmilloy @joffreypeters Let me know what you think about the above proposal. No rush on this. I just wanted to get my thoughts down while I was having them.

joffreypeters commented 1 year ago

@mpu-creare I think this looks reasonable.

I think there's no benefit to including a factory input to the factory method - that's something someone can just as easily implement if they want to use one. I would also consider calling your make method create perhaps?

I do have some questions about how this would affect GeoWATCH - my impression is that it would touch virtually every class in some way - even if we preserved the default usage here, we'd still need to modify most or all constructors in the codebase - not a big lift, but would touch most files.

mpu-creare commented 1 year ago

@joffreypeters create is better, I'm terrible with names.

You would only need to modify the code in the places where classes are instantiated. That might actually be a pretty small subset... but it depends how much optimization we want to do because this change would provide a lot of potential optimizations. (i.e. caching the raw data instead of interpolated version, interpolating as the last step for most of the pipelines, etc.)

jmilloy commented 1 year ago

I think it would be helpful if you provide a specific problem with the current caching so that it is clear if the proposed change solves that problem.

In theory, I'm strongly in favor of making the interpolation and caching more explicit, which I think the first part of your proposal does. The competing issue is the ease-of-use for simple or common cases, which is what I think the second part of your proposal is. The challenge here is to maintain explicit behavior without overburdening the user for simple use-cases. In general, a slightly more verbose API that requires more boilerplate can actually be less burden on users. For me, the factory argument ventures into implicit or confusing behavior where the make method is trying to do two different things in the same method. I think the make classmethod without the factory arguments is a reasonable API.

What about a Node method that returns an Interpolation node and another than returns a Cache node:

class Node(...):
    def cache(self, **kwargs):
        return podpac.Cache(source=self, **kwargs)

    def interpolation(self, **kwargs):
        return podpac.Interpolation(source=self, **kwargs)

One-line usage would be:

node = podpac.data.Rasterio(source=<path>).cache().interpolation()

Not sure this is any better, especially in python syntax. So just another proposal that might keep the juices flowing or even clarify why the flat make version is better.

Last thing. My podpac.Environment proposal (https://github.com/creare-com/podpac/issues/451#issuecomment-850514165) comes to mind. The other alternative that I have tickling my brain is to push all caching into this environment. Your make classmethod would be replaced with a interpolate classmethod that does the same thing but only handles interpolation. The Environment would inject caching at the desired node.

with podpac.Environment(...) as env:
    data_node = Rasterio(source=<path>)
    node = Interpolation(source=data_node, <interpolation args>)
    env.cache(data_node, <caching args>) # inject caching at the data_node before interpolation
    node.eval(...)

# or, with the factory:
with podpac.Environment(...) as env:
    node = Rasterio.interpolation(source=<data>, <interpolation_args>)
    env.cache(node.source) # inject caching before interpolation
    node.eval(...)

This comes as a user spec and no claims about implementation.

One thing that stands out to me now that I write this (and thinking about representing this spec in the JSON pipeline, for example) is: does caching belong in the pipeline? It doesn't effect the results, only the running behavior just like threading and other execution controls. A JSON pipeline coupled with an execution environment makes sense to me, with caching included in the execution environment. In a serialized pipeline/environment, the nodes you want to cache could easily be listed by name.

The more I type this out, the more I'm on the side of keeping caching out of the pipeline, though I expect that there are strong practical reasons that this would become unwieldy that I don't remember. I don't mean that there wouldn't be a Caching node in the implementation, but that in normal user-facing API usage, you would access it another way.

joffreypeters commented 1 year ago

it depends how much optimization we want to do because this change would provide a lot of potential optimizations. (i.e. caching the raw data instead of interpolated version, interpolating as the last step for most of the pipelines, etc.)

Optimizations would be new features/updates, so those are just a bonus feature we'd be able to implement, not code we'd have to change to preserve current functionality. In that sense, I think the changes are pretty small, and the potential improvements with caching might be substantial.

mpu-creare commented 1 year ago

@jmilloy I like your proposed interface much better:

node = podpac.data.Rasterio(source=<path>).cache().interpolation()

is nice and explicit, and it's clear where you pass arguments to the subsequent nodes. You're basically providing convenient functions that add Interpolation and Caching nodes. Very nice.

I think it would be helpful if you provide a specific problem with the current caching so that it is clear if the proposed change solves that problem.

See #219

does caching belong in the pipeline? It doesn't effect the results, only the running behavior just like threading and other execution controls.

That's a good design philosophy question. We also have a few manager nodes that control where and how nodes are computed. You can insert a node that runs on the cloud inside a pipeline that's otherwise completely local. So those would fall under the same category as caching. Note that in some cases you might have a node that HAS to be run on a particular cloud API endpoint because that particular machine has a special implementation -- which makes it sort of a Datasource node but not quite. It gets blurry, is my point.

From a practical standpoint, it's pretty nice to set up a system for a particular application that runs just the way you want it to. In that case building caching, parallel and cloud evaluations into the pipeline is quite convenient. It suits our main use cases very nicely.

From a sharing standpoint it's terrible -- when you ship the JSON off to another system, you're no longer just defining the product, but how it should be executed and stored. That feels like a breaking feature for the purposes of sharing. But, this doesn't match our main use case very often...

I'll have to think about that one.

mpu-creare commented 1 year ago

@jmilloy I think having caching as a node or like that context manager would solve the propagate problem as well (https://github.com/creare-com/podpac/issues/451#issuecomment-850514165). When every action is a Node, you don't need to pass common traits down the chain anymore.

Regardless, getting caching out of the Node class would solve a lot of the issues I run into with caching in general. Being more explicit about what's cached when would be a great boon.

jmilloy commented 1 year ago

I'm still curious about the necessary caching flexibility. Is the Cache node necessary. Actually, does it even provides the necesary flexibility?

Currently, every node knows how to cache itself, it's just somewhat too hard to control which nodes are getting cached and how. For example, you often want easy built-in interpolation while optionally caching the source data.

A Cache node makes that very explicit, but it's not at all easy to turn on and off (e.g. the SMAP example #219). I'd have to hear a little bit more about how the Cache node addresses that case. I vaguely proposed specifying which nodes should be cached more clearly through an execution environment or other tagging mechanism. I think that addresses the issues more flexibly, but the implementation is a bit more uncertain.

Removing the InterpolationMixin really clarifies and control of data retrieval vs interpolation for the user. I'm curious if a single cache_source attribute gives you what you need in most situations. That is, these are all the same

# explicit
data_node = Rasterio(source=path, cache=True)
node = Interpolation(source=data_node)

# use cache_source
data_node = Rasterio(source=path)
node = Interpolation(source=data_node, cache_source=True)

# one-liner proposal
node = Rasterio.interpolation(source=path, cache_source=True)

# two versions of the other one-liner proposal
node = Rasterio(source=path, cache=True).interpolation()
node = Rasterio(source=path).interpolation(cache_source=True)

It's more clear why you need this when you are defining actual data sources:

class MyDatasource(Rasterio):
    source = MY_PATH

class MyInterpolatedDatasource(Interpolation):
    source = MyRawDatasource()

# now you can easily use the data source with or without caching:
data_nocache = MyDatasource()
data_cached = MyDatasource(cache=True)

# and you can use the interpolated datasource with caching at any level
node_nocache = MyInterpolatedDatasource()
node_datacached = MyInterpolatedDatasource(cache_source=True)
node_interpcached = MyInterpolatedDatasource(cache=True)
node_bothcached = MyInterpolatedDatasource(cache=True, cache_source=True)

For completeness, the interpolation method proposals allow you to skip defining the interpolated class.

class MyDatasource(Rasterio):
    source = MY_PATH

# classmethod version
node = MyDatasource.interpolation(cache_source=True)

# regular method version, two different ways that are the same
node = MyDatasource(cache=True).interpolation()
node = MyDatasource().interpolation(cache_source=True)

mpu-creare commented 1 year ago

@jmilloy let's take this discussion over to #219, which is cache-specific. I think you're right in that we need to absolutely nail down the use-cases before we figure out the design.

creare-com / podpac

4.0 Release Ideas #498