ecmwf / earthkit-data

A format-agnostic Python interface for geospatial data
Apache License 2.0
53 stars 12 forks source link

Refactor the stream API #364

Closed sandorkertesz closed 4 months ago

sandorkertesz commented 5 months ago

Is your feature request related to a problem? Please describe.

The proposal is to change the usage of streams in the following way.

When stream=True the returned object would be a Fieldlist (for GRIB data):

ds = from_source("url", "http://..../my_data.grib", stream=True)

for f in ds:
     # f is now a Field

# at this point ds consumed the stream

Iterating in batches would be a generic option (not only stream specific):

ds1 = from_source("file", "my_local_data.grib")
ds2 = from_source("url", "http://..../my_data.grib", stream=True)

for f in ds1.batched(2):
     # f is now a Fieldlist with 2 Fields

for f in ds2.batched(2):
     # f is now a Fieldlist with 2 Fields

group_by would behave in a similar way.

ds1 = from_source("file", "my_local_data.grib")
ds2 = from_source("url", "http://..../my_data.grib", stream=True)

for f in ds1.group_by("level"):
     # f is now a Fieldlist

for f in ds2.group_by("level"):
     # f is now a Fieldlist

Please note that using group_by for non-stream data will be based on the metadata from the full dataset. However, for the stream it would be simply built by consuming GRIB messages from the stream until the values of the metadata keys specified in group_by change.

We could read the whole stream into memory with the read_all option:

ds = from_source("url", "http://..../my_data.grib", stream=True, read_all=True)

# ds is now a Fieldlist in memory, so all these work
len(ds)
r = ds.sel(param="t")

for f in ds:
     # f is now a Field

for f in ds.batched(2):
     # f is now a Fieldlist with 2 Fields