esgf2-us / intake-esgf

Programmatic access to the ESGF holdings
https://intake-esgf.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
12 stars 6 forks source link

Proposal: integrate remote subsetting with rooki #72

Open nocollier opened 3 weeks ago

nocollier commented 3 weeks ago

While working on a tutorial for the use of rooki and WPS to do subsetting and averaging, it struck me that we could partially integrate this into the intake-esgf interface. I am a bit torn about whether it is a good idea and I can see both sides of the argument:

As I look at the situation, I see that the WPS services could be made much more useful if we integrate them into the intake-esgf interface in a sensible and completely optional manner. I am thinking of an interface like:

dsd = cat.to_dataset_dict(
    subset=dict(
        time=slice("1990-01-01", "2000-01-01"),
        lat=slice(-10, 35),
        lon=slice(0, 110),
    )
)

Internally, this is some of the logic we would want to implement:

nocollier commented 3 weeks ago

The above is a clean way to allow for subsetting, but is less clean as you think about the other WPS operations. A more flexible interface may be to provide an option for users to pass a rooki workflow. For example,

workflow = ops.AverageByTime(
    ops.Subset(
        ops.Input("tas", [rooki_id]),
        time="1990-01-01/2000-01-01",
        area="65,0,100,35",
    ),
    freq="year",
)

This won't work exactly in this way because the workflow definition includes an id of the dataset (rooki_id) and the variable_id of the dataset. We could allow a function of the type:

def my_rooki_workflow_function(variable_id: str, rooki_id: str) --> rooki.workflow:
    workflow = ops.AverageByTime(
        ops.Subset(
            ops.Input(variable_id, [rooki_id]),
            time="1990-01-01/2000-01-01",
            area="65,0,100,35",
        ),
        freq="year",
    )
    return workflow

dsd = cat.to_dataset_dict(rooki_workflow=my_rooki_workflow)

I would like to get a type for the rooki workflow so we can check the type of function that the user passes in. The user would need to understand that this function will execute uniformly on all datasets in the catalog. We could cache results in the same way, by using a hash of the function source code and dataset_id.

The downside to this approach is that, if a dataset in the catalog is not available on a WPS-ready server, we can't convert this workflow to commensurate xarray syntax so that all datasets are returned consistently. For example, this one would be:

ds[variable_id].sel(
    time=slice("1990-01-01", "2000-01-01"),
    lat=slice(0, 35),
    lon=slice(65, 100),
).mean(dim="time")

But in general, I am not sure how we could do this. Perhaps it isn't so hard and we can build a rooki_to_xarray() workflow transformation.