fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
310 stars 80 forks source link

Refactor file format backend openers #376

Open TomNicholas opened 1 year ago

TomNicholas commented 1 year ago

Problem

The API for Kerchunk's file format backend openers doesn't follow a consistent pattern.

Suggestion

Change the openers to each be a function returning a VirtualZarrStore (see #375), with standardized keyword arguments.

Advantages

Implementation ideas

Questions

How to handle GRIB files? Combine before returning? Return as a hierarchy of multiple groups within a single store (like when opening with datatree)? Or return as list of VirtualZarrStores?

martindurant commented 1 year ago

I would first point out that there is a little bit of consistency injected via classes that call functions, e.g., kerchunk.grib2.GribToZarr is a class designed to feel similar to kerchunk.hdf.SingleHdf5ToZarr.

A general file dispatch system seems reasonable, possibly something that belongs in Intake 2 (which already tries to guess file types by URL pattern matching or reading magic bytes). We probably don't want to replicate work in pangeo-forge, though?

Should there be some arguments that are valid for every backend (e.g. inline_threshold), and others that are specific to particular backends?

There are definitely operations that will be the same for all backends, like inlining.

On virtual zarrs, this sounds something between https://github.com/nsidc/earthaccess/pull/278 and a special xarray engine="scan-kerchunk". The trouble is, as with everything kerchunk, is that there are many options (such as what to do with gribs...) and it becomes hard to specify them all in a reasonable way. Not all of kerchunk will be xarray friendly (and maybe not even zarr).