MeteoSwiss / mlpp-features

Define, track, share, and discover mlpp features.
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Stable interface to mlpp-workflows #14

Open frazane opened 2 years ago

frazane commented 2 years ago

Every time we change the way pipelines are called (e.g. by changing function arguments) we have to adapt the code in mlpp-workflows accordingly. It would be better instead if we had a stable interface between the two libraries.

It could be in the form of an output xr.Dataset object.

This could be done by simply moving the following function (defined in mlpp-workflows) to this library.

def extract_features(
    data: Dict[str, xr.Dataset],
    feature_list: List[str],
    points: Tuple[List],
    reftimes: List[datetime],
    leadtimes: List[int],
) -> xr.Dataset:
    """Extract features from a given source."""
    ds = xr.Dataset()
    for feature in feature_list:
        LOGGER.info(f"FEATURE: {feature}")
        try:
            output = getattr(globals()["mlpp_features"], feature)(
                data, points, reftimes, leadtimes, ds=ds
            )
        except:
            LOGGER.exception(f"{feature} pipeline failed!")
        ds[feature] = output.chunk("auto").persist()
    LOGGER.info(ds)
    return ds

It will also be easier to document how the two libraries interact since it will be just one object.

@dnerini thoughts?

dnerini commented 2 years ago

Hi @frazane, thanks for the nice suggestion. Indeed, the interface to mlpp-features is defined as a xr.Dataset object (all pipelines return that). This said, I like the idea of moving the extract_features method to mlpp-features! Moreover, I think it could be interesting to refactor it as a class, say a FeatureStore class, and use that not only to return the feature dataset (as in the original method above), but also to discover and explore features, for example to retrieve the list of all the input parameters given a list of features. What you think?

frazane commented 2 years ago

Nice idea! A class with two main methods: extract and discover? And discover could be used from mlpp-workflows.