Open nocollier opened 3 weeks ago
The above is a clean way to allow for subsetting, but is less clean as you think about the other WPS operations. A more flexible interface may be to provide an option for users to pass a rooki workflow. For example,
workflow = ops.AverageByTime(
ops.Subset(
ops.Input("tas", [rooki_id]),
time="1990-01-01/2000-01-01",
area="65,0,100,35",
),
freq="year",
)
This won't work exactly in this way because the workflow definition includes an id of the dataset (rooki_id
) and the variable_id of the dataset. We could allow a function of the type:
def my_rooki_workflow_function(variable_id: str, rooki_id: str) --> rooki.workflow:
workflow = ops.AverageByTime(
ops.Subset(
ops.Input(variable_id, [rooki_id]),
time="1990-01-01/2000-01-01",
area="65,0,100,35",
),
freq="year",
)
return workflow
dsd = cat.to_dataset_dict(rooki_workflow=my_rooki_workflow)
I would like to get a type for the rooki workflow so we can check the type of function that the user passes in. The user would need to understand that this function will execute uniformly on all datasets in the catalog. We could cache results in the same way, by using a hash of the function source code and dataset_id
.
The downside to this approach is that, if a dataset in the catalog is not available on a WPS-ready server, we can't convert this workflow to commensurate xarray syntax so that all datasets are returned consistently. For example, this one would be:
ds[variable_id].sel(
time=slice("1990-01-01", "2000-01-01"),
lat=slice(0, 35),
lon=slice(65, 100),
).mean(dim="time")
But in general, I am not sure how we could do this. Perhaps it isn't so hard and we can build a rooki_to_xarray()
workflow transformation.
While working on a tutorial for the use of rooki and WPS to do subsetting and averaging, it struck me that we could partially integrate this into the intake-esgf interface. I am a bit torn about whether it is a good idea and I can see both sides of the argument:
rooki_id
in that tutorial. This is different depending on the service and at ORNL even depends on what data you are looking for. (i.e. CMIP6 has a different prefix than CMIP5 and even some of our CMIP6 isn't in the same place). In my experience this will be a hurdle for many users.As I look at the situation, I see that the WPS services could be made much more useful if we integrate them into the
intake-esgf
interface in a sensible and completely optional manner. I am thinking of an interface like:Internally, this is some of the logic we would want to implement:
subset
keyword is given, we would check that rooki is installed. I envision an optional install requires a lapip install intake-esgf[wps]
. This way not everyone need install the client.subset
keyword is given, then we need to look at the IDs that were in the catalog dataframe and partition out those we can operate on remotely. At the moment, this would be finding those at ORNL or DKRZ. This would need to take 2nd place in the priority list, below locally available and above streaming.sel
functions to achieve the same result albeit no server-side subsetting occurs.