speed improvement - Githubissues

mariahpope commented 1 week ago

As I work with very large zarr stores in the cloud (such as an ERA5 dataset that goes back 60+ years) I have noticed that the framework gets hung up and takes a very long time to run even if I only want to build a small dataset (e.g. a couple days or a month for a test) from it. After some digging I noticed that the current setup loops through all time steps in the dataset as it loads. This means that if I only want a day or two of data from a zarr store that is 60+ years it will loop through the whole store no matter what. Is this behavior intended/expected? Or could there be a fix to only look at data between dates that were specified in the yaml?

A current idea would be to update lines 58-63 in init.py to subset the store to a time range of interest:

Current code:

    fs = XarrayFieldList.from_xarray(data, flavour)

    if len(dates) == 0:
        return fs.sel(**kwargs)
    else:
        result = MultiFieldList([fs.sel(valid_datetime=date, **kwargs) for date in dates])

Possible update/solution could look something like this:

    if len(dates) == 0:
        fs = XarrayFieldList.from_xarray(data, flavour)
        return fs.sel(**kwargs)
    elif len(dates) == 1:
        data = data.sel(time=dates[0])
        fs = XarrayFieldList.from_xarray(data, flavour)
        result = MultiFieldList([fs.sel(valid_datetime=date, **kwargs) for date in dates])
    else:
        data = data.sel(time=slice(dates[0], dates[-1]))
        fs = XarrayFieldList.from_xarray(data, flavour)
        result = MultiFieldList([fs.sel(valid_datetime=date, **kwargs) for date in dates])

This would clip a store to just the first time step that is needed to initialize the dataset, and then would clip to the time range specified in the yaml before starting the loading process. I tested this out and it seems to work, and it greatly speeds up the framework when working with subsets of large zarr stores. Is there a reason that the code needs to process the whole store? If this is an update that would be helpful I can gladly start a PR! Would love to hear your thoughts.

Thank you!!

floriankrb commented 6 days ago

Thanks for the idea, improving build speed is also important for anemoi-datasets, and a PR that makes it more efficient would be indeed welcome.

It seems that the change you suggest would be more appropriate in the XarrayFieldList class https://github.com/ecmwf/anemoi-datasets/blob/f5e09001dbae275cb5aec7f4bde0f420f45b9969/src/anemoi/datasets/create/functions/sources/xarray/fieldlist.py#L124

Or perhaps directly into earthkit.data in FieldList, see from earthkit.data.core.fieldlist import FieldList ?

mariahpope commented 3 days ago

Thanks for the response and suggestion!

After some testing I wasn't able to create the desired solution within the sel() function, but was able to within the from_xarray function in the fieldlist script. I entered the following code here: https://github.com/ecmwf/anemoi-datasets/blob/80de4c6f6ef0d720b01fc1203211d66bcb4887e7/src/anemoi/datasets/create/functions/sources/xarray/fieldlist.py#L74

        if dates is not None: 
            for coord in ds.coords:
                c = guess.guess(ds[coord], coord)
                if c.is_time:
                    time_coord_name = c.name
            if len(dates) == 1:
                ds = ds.sel(**{time_coord_name:dates[0]})
            else:
                ds = ds.sel(**{time_coord_name:slice(dates[0], dates[-1])})

This also requires adding dates as an optional argument to the from_xarray function. I would keep it optional in case dates were not provided in the yaml, and the coordinate guesser would be necessary to make sure we have the correct name of the time coordinate (in the chance it is not in fact named "time").

Let me know your thoughts! Happy to submit a PR if a solution like this would be seen as helpful. I have found it to greatly speed up load time from larger datasets.

ecmwf / anemoi-datasets

speed improvement #116