Proposal: integrate remote subsetting with rooki

While working on a tutorial for the use of rooki and WPS to do subsetting and averaging, it struck me that we could partially integrate this into the intake-esgf interface. I am a bit torn about whether it is a good idea and I can see both sides of the argument:

We are experiencing feature creep and there are some who would rather we keep our focus razor thin. I can see this as adding all these features and integrations makes the package more complex. If all you want is the query to dataframe capability, you have to install the whole rest of the package even though you won't use it.
However, as I look over the rooki interface, I see a few hurdles for users. First, they need to somehow know the WPS endpoint address and how to constrain intake-esgf to give them records for just what will be local. This is information you just have to know. Second, they would have to know for each WPS service what to prepend to what I call the rooki_id in that tutorial. This is different depending on the service and at ORNL even depends on what data you are looking for. (i.e. CMIP6 has a different prefix than CMIP5 and even some of our CMIP6 isn't in the same place). In my experience this will be a hurdle for many users.

As I look at the situation, I see that the WPS services could be made much more useful if we integrate them into the intake-esgf interface in a sensible and completely optional manner. I am thinking of an interface like:

dsd = cat.to_dataset_dict(
    subset=dict(
        time=slice("1990-01-01", "2000-01-01"),
        lat=slice(-10, 35),
        lon=slice(0, 110),
    )
)

Internally, this is some of the logic we would want to implement:

If the subset keyword is given, we would check that rooki is installed. I envision an optional install requires a la pip install intake-esgf[wps]. This way not everyone need install the client.
If the subset keyword is given, then we need to look at the IDs that were in the catalog dataframe and partition out those we can operate on remotely. At the moment, this would be finding those at ORNL or DKRZ. This would need to take 2nd place in the priority list, below locally available and above streaming.
- How would we partition if both are available?
- Can they be launched in threaded fashion?
This means we would have the WPS locations and details stored locally and we just use them if possible.
We should provide caching for the user behind the scenes. Rooki let's you set the path to download the subset variables and we would want to point to someplace in the local cache. In my mind we could take the dataset_id and the subset arguments and form a hash that we use to save/load the subset data automatically.
To remain consistent, if no WPS services are available for the datasets in the catalog, then we should use xarray sel functions to achieve the same result albeit no server-side subsetting occurs.

The above is a clean way to allow for subsetting, but is less clean as you think about the other WPS operations. A more flexible interface may be to provide an option for users to pass a rooki workflow. For example,

workflow = ops.AverageByTime(
    ops.Subset(
        ops.Input("tas", [rooki_id]),
        time="1990-01-01/2000-01-01",
        area="65,0,100,35",
    ),
    freq="year",
)

This won't work exactly in this way because the workflow definition includes an id of the dataset (rooki_id) and the variable_id of the dataset. We could allow a function of the type:

def my_rooki_workflow_function(variable_id: str, rooki_id: str) --> rooki.workflow:
    workflow = ops.AverageByTime(
        ops.Subset(
            ops.Input(variable_id, [rooki_id]),
            time="1990-01-01/2000-01-01",
            area="65,0,100,35",
        ),
        freq="year",
    )
    return workflow

dsd = cat.to_dataset_dict(rooki_workflow=my_rooki_workflow)

I would like to get a type for the rooki workflow so we can check the type of function that the user passes in. The user would need to understand that this function will execute uniformly on all datasets in the catalog. We could cache results in the same way, by using a hash of the function source code and dataset_id.

The downside to this approach is that, if a dataset in the catalog is not available on a WPS-ready server, we can't convert this workflow to commensurate xarray syntax so that all datasets are returned consistently. For example, this one would be:

ds[variable_id].sel(
    time=slice("1990-01-01", "2000-01-01"),
    lat=slice(0, 35),
    lon=slice(65, 100),
).mean(dim="time")

But in general, I am not sure how we could do this. Perhaps it isn't so hard and we can build a rooki_to_xarray() workflow transformation.

esgf2-us / intake-esgf

Proposal: integrate remote subsetting with rooki #72