fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
305 stars 78 forks source link

Represent (coordinate) variables "symbolically" #361

Open TomAugspurger opened 1 year ago

TomAugspurger commented 1 year ago

I'm working with a GRIB2 file, and am interested in minimizing the size of the references file. Currently, the largest values in the references come from the base64-encoded coordinates that were inlined in the references:

  'latitude/0': 'base64:AAAAAACAVkBmZmZ...',

This specific variable (and longitude, step, and perhaps time) can be represented "symbolically" (maybe not the right name), with something like a range(90, -90.1, -0.4).

My questions:

  1. Does something like this make sense to try?
  2. Does this instead belong in Zarr instead? It seems more generally useful to compress the size of the data, beyond just what Kerchunk inlines (though I'd still want it in Kerchunk, so that inlined references can benefit from it).

Somewhat annoyingly, there are floating point inaccuracies between what I get from np.arange and what's coming out of cfgrib. But hopefully those can be solved.

martindurant commented 1 year ago

This is certainly something that kerchunk could do, with effectively our own codec to expand whatever representation into an array at read time. That would be simple for linear coordinates, but GRIB allows for many complex coordinate definitions. I suppose it's possible to extract the parameters of whatever the coordinate system is, but we probably don't want to implement the coordinate generation algorithms, but call the appropriate functions in eccodes itself, if we can.

This all connects to the possibility of analytical coordinates in xarray. Perhaps we shouldn't be making arrays even at read time but making xarray indexes.

dcherian commented 1 year ago

There's a CF convention for that!

We could totally interpret those as a "functional xarray index" too.

martindurant commented 1 year ago

There's a CF convention for that

(plus also the FITS WCS ways to define the same; you won't get these from geo-datasets, but I think they may be more general)

martindurant commented 1 year ago

People on this thread might be interested in the intake-stac sprint https://github.com/intake/intake-stac/issues/159

TomAugspurger commented 1 year ago

Thanks @dcherian. IIUC, the coordinate subsampling you linked to is essentially the same as range(0, 10, 1)? We just have two "tie points" (the first and last point) and then linearly interpolate between them?

Do you know if this decoding is implemented in cf-xarray or xarray.conventions.decode_cf_variable? I didn't see it at https://cf-xarray.readthedocs.io/en/latest/coding.html or in a glanace at decode_cf_variable.

dcherian commented 1 year ago

It has not been implemented.

dcherian commented 1 year ago

We just have two "tie points" (the first and last point) and then linearly interpolate between them?

Yes I think so, that's why it clicked in my head. I don't know what you would do for all the other GRIB coordinate systems

martindurant commented 1 year ago

We just have two "tie points"

This is also essentially the case in standard TIFF, but of course more complex geometries are possible in practice, and GRIB has many models.