fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
310 stars 80 forks source link

Refactor MultiZarrToZarr into multiple functions #377

Open TomNicholas opened 1 year ago

TomNicholas commented 1 year ago

Problem

MultiZarrToZarr is extremely powerful but rather hard to use.

This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.

Part of the problem is that this one MultiZarrToZarr function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):

  1. Concatenation along a single existing dimension. Achieved by xr.concat where dim is a str
  2. Concatenation along a single new dimension (optionally providing new coordinates to use along that new dimension). Achieved by xr.concat where dim is a set of values
  3. Merging of multiple variables which already share dimensions, first aligned according to their coordinates. Achieved by xr.merge
  4. "Combining" by order given, which means some ordered combination of concatenation along one or more dimensions and/or merging. Achieved by xr.combine_nested
  5. "Combining" by coordinate order, which again means some ordered combination of concatenation along one or more dimensions and/or merging, but the order is specified by information in the datasets' coordinates. Achieved by xr.combine_by_coords

In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through MultiZarrToZarr. It also cannot currently easily handle certain types of multi-dimensional concatenation.

Suggestion

Break up MultiZarrToZarr by defining a set of functions similar to xarray's merge/concat/combine/unify_chunks that consume and produce VirtualZarrStore objects (EDIT: see https://github.com/fsspec/kerchunk/issues/375).

Advantages

Questions

ivirshup commented 1 year ago

I've been thinking of something similar. I'm not sure we can get all the flexibility we want out of just two functions, but think a couple classes may do the trick. I've been thinking of an API like:

g = kerchunk.KerchunkGroup()
# issubclass(KerchunkGroup, MutableMapping[str, KerchunkGroup | KerchunkArray])

# Concatenation + assignment, following array-api
g["x"]: kerchunk.KerchunkArray = kerchunk.concat([...], axis=0)

# Write a manifest
g.to_json(...)

# Getting a zarr group:
g.to_zarr()

# Reading from a manifest
g2 = kerchunk.KerchunkGroup.from_json(...)

# Assignment
g2["new_group"] = g

# Single level merging
g.update(g2)

# Add an array from a different store:
g.create_group("my").create_group("new")
g["my"]["new"]["array"] = zarr.open("matrix.zarr"): zarr.Array

Then leave anything more complicated to something like xarray, which could re-use its combination logic on a collection of KerchunkArrays.

This suggestion has a lot of the functionality already in concatenate_arrays and merge_vars, but:

This idea may be more of a zarr thing than a kerchunk thing though.


Also, I think you may be missing a reference to https://github.com/fsspec/kerchunk/issues/375 under the "Suggestion" heading.

rabernat commented 1 year ago

Right now we are using Kerchunk as a workaround for things that Zarr itself does not support, by implementing various types of indirection on chunk keys at the Store level. I think this is a great way to prototype new possibilities. The downside is that only kerchunk-aware clients can ever hope to read these datasets. Also, from a design point of view, stores are technically ignorant of the Zarr data model (arrays, groups, etc.); they are formally just low-level key value stores.

So when it comes time to refactor those prototypes into more stable interfaces, I think we should seriously consider upstreaming some of these features into Zarr itself, possibly via new Zarr extensions. The advantage of that approach is that

  1. We can implement these features via a specification, with an aim for multi-language support
  2. We can leverage Zarr's understanding of the data model (arrays, groups, chunks, dimensions, etc.) to make the implementation cleaner

I believe that all 5 of the features above could be implemented in this a way. In particular, the concept of a "chunk manifest": a document which stores information about where to find the chunks for an array (rather than assuming they follow the Zarr chunk encoding scheme) would get us very far. The outcome of this would be a Zarr array with much more powerful virtualization capabilities, something between Xarray's duck array types and the current Kerchunk situation.

We would be eager to collaborate on this and have some work in this direction planned for this quarter.

rabernat commented 1 year ago

To be more concrete about what I mean, we could imagine writing code like this

import zarr

a1 = zarr.open("array1.zarr")  # zarr.Array
a2 = zarr.open("array2.zarr")  # zarr.Array 

# this would create a Zarr array with the correct extensions to support indirection to a1 and a2
# (need to figure out the right API)
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")

a3[0] = 1  # write some data...automatically goes into the correct underlying array
ivirshup commented 1 year ago

We can implement these features via a specification, with an aim for multi-language support

We definitely want to see multi language support (cc: @manzt)

Do you have thoughts on what extensions to zarr would need to be made? Definitely some sort of specification for being able to pass around manifests. But maybe this means needing to define some set of stores as well?

Does mapping an HDF5 file to zarr fit as a zarr extension? As high priority of one?


we could imagine writing code like this

a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")

My idea currently would be that you don't need to assign to a store on concatenation. I think this could be a purely in memory object and you resolve everything at access or storage time.

I'd also even say I'm not sure a "reference" based array should be an instance or subclass of zarr.Array. Maybe a sibling class, or just a shared interface would work. I'm thinking of use cases like creating virtual arrays which don't use the exact same codecs (like HDF5 virtual datasets allow).

a3[0] = 1  # write some data...automatically goes into the correct underlying array

I've mostly been thinking about read access at the moment. Not sure how this complicates things.

We would be eager to collaborate on this and have some work in this direction planned for this quarter.

Me too!

martindurant commented 1 year ago

I've mostly been thinking about read access at the moment. Not sure how this complicates things.

read-only is of course easier, but also just a special case of read-write. Our arraylake friends have a well-defined model of where and how writes should go regardless of where the original mapper/chunks were defined.

manzt commented 1 year ago

We definitely want to see multi language support (cc: @manzt)

Awesome. Yah, something that should be very possible with zarrita.js.

martindurant commented 1 year ago

Repeating https://observablehq.com/@manzt/ome-tiff-as-filesystemreference using some of these newer ideas would be awesome :) (and note that that original example was not, of course, a standard xarray-compatible dataset, but a subscale pyramid)

manzt commented 1 year ago

Sorry I've been out of the kerchunk loop for a bit but I've been trying to stay abreast of changes. zarrita.js has an HTTP-based reference store (and will try to map references with s3/gcs protocols to http endpoints). This higher-level virtualization of zarr is very exciting.

martindurant commented 1 year ago

(sorry I haven't yet had the chance to review this)

TomNicholas commented 1 year ago

Thanks @rabernat - I think the proposal of defining how arrays should be concatenated at the Zarr level is really cool, and I would be interested to help. I also think that working this out consistently at the Zarr level would help expose the ways in which the current kerchunk API brushes subtleties under the rug.

One of these subtleties actually just came up for me in practice in #386 - I got confused by whether concatenation in Kerchunk was supposed to follow xarray rules or lower-level array-like rules instead.

We therefore have to be really clear about our type systems, as we now have at least 3 different data structures for which concatenation would be defined differently:

1) numpy-like arrays - no dimension names, no index-backed coordinates, so concatentation just follows axis number, 2) zarr (v3) arrays - has dimension_names, but no index-backed coordinates, so how should concatentation work? 3) xarray objects - has dimension names and index-backed coordinates, so concatenation follows dimension name but has to make exceptions for index-backed coordinates (see https://github.com/fsspec/kerchunk/issues/386#issuecomment-1792746248).

If we could get this straight it would be really powerful though. Also then ultimately kerchunk might not need to deal with combining at all, it would become exclusively a collection of zarr readers for different legacy file formats.

dcherian commented 1 year ago

zarr (v3) arrays - has dimension_names

Aren't dimensions name optional? In which case these should just behave like numpy arrays?

From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#dimension-names

If specified, must be an array of strings or null objects with the same length as shape. An unnamed dimension is indicated by the null object. If dimension_names is not specified, all dimensions are unnamed.

TomNicholas commented 1 year ago

Aren't dimensions name optional?

Oh I guess so!

In which case these should just behave like numpy arrays?

But obviously propagating the dimension_names in a transparent and consistent way. Doesn't that potentially raise problems already? Like

z1 = zarr.Array(1d_data, dim_names=['t'])
z1 = zarr.Array(1d_data, dim_names=['x'])

zarr.concat([z1, z2], axis=0)  # well-defined numpy-like operation, should return a 1D array, but what dim_name should the result have?

I'm just saying that having Zarr array type(s) that explicitly encode these propagation rules would be far better than implicitly doing this stuff in the depths of kerchunk.

EDIT: It reminds me of this suggestion https://github.com/data-apis/array-api/issues/698#issuecomment-1791959983

martindurant commented 1 year ago

I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants. I am aware that to some extent we end up replicating stuff in xarray (like the encoding stuff), but I don't mind that for the same of being explicit. What we really need, is a good set of test cases to describe expected behaviour.

I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.

I wonder how we can get together kerchunk-interested people? Do we need a dedicated slack or discord or something?

ivirshup commented 1 year ago

But obviously propagating the dimension_names in a transparent and consistent way. Doesn't that potentially raise problems already?

I think it's reasonable to let people do concatenation like numpy arrays regardless of what the dimension names are. array API as much as possible would be a good idea, just to reduce the amount of new things one needs to learn.

Could something like the merge logic in xarray be used for handling dimension names and other metadata (i.e. anything in attrs)? E.g. rules like "unique", "first"?


I've been thinking about operations other than concatenation that are achievable with kerchunk:

It would be nice to have these operations exposed in a way that doesn't require rewriting actual chunk data.


I wonder how we can get together kerchunk-interested people? Do we need a dedicated slack or discord or something?

If there was a zarr zulip/ discord/ slack maybe kerchunk could fit in there?

martindurant commented 1 year ago

I think it's reasonable to let people do concatenation like numpy arrays regardless of what the dimension names are. array API as much as possible would be a good idea, just to reduce the amount of new things one needs to learn.

I second this, especially given that not all HDF or other files will be netCDF compliant anyway, and it would be good to have an opinion on what to do with them.

dcherian commented 1 year ago

Transpose/ permute dims (re-order chunks + add a transpose codec) Rename dimensions Modify metadata Limited subsetting

Xarray already has a array wrapper type that does all this except concatentation and metadata handling: https://github.com/pydata/xarray/issues/5081.

dcherian commented 1 year ago

For a long time I've felt like there's code we can reuse from dask for the concatenation piece. A LazilyConcatenatedArray is really just a chunked array. And then when coerced to numpy, use dask's _concatenate3 or np.block

martindurant commented 1 year ago

From kerchunk's point of view, though, we want to figure out where references belong, rather than building a dask graph. The code would probably be quite different?

I was wondering whether dask concatenation could get around the "different chunking" problem beyond var-chunks; i.e., each input dataset is first rechunked to a common grid and then concatenated, making an aggregate of aggregate datasets. That would be the only thing you could do in some circumstances. If the dask chunks are bigger than the native chunks, it may even be reasonably performant.

TomNicholas commented 1 year ago

Xarray already has a array wrapper type that does all this except concatentation and metadata handling:

A LazilyConcatenatedArray is really just a chunked array.

This is a really useful observation, and I agree we could probably re-use code from xarray & dask. I think we should avoid a dependency on either though.

EDIT: Avoiding a dependency on xarray would be good motivation to create the standalone lazy arrays package.

TomNicholas commented 1 year ago

I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants.

I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.

To satisfy both of these concerns at once we would need to make a 3rd repository just for VirtualZarr. I suppose that could then be eventually be moved in Zarr upstream. Not sure what people think about that idea.

martindurant commented 1 year ago

Why not make that thing here in kerchunk?

TomNicholas commented 1 year ago

Why not make that thing here in kerchunk?

Maybe it could, but I'm imagining there might be other uses for the VirtualZarr idea that aren't just in the kerchunk package? e.g. the ArrayLake stuff?

I do also think now that it would be conceptually neat to separate a concatenateable VirtualZarr class (/ "chunk manifest") from the kerchunk code that actually reads specific legacy file formats. I don't really have a strong opinion on this though.

martindurant commented 1 year ago

Since the kerchunk package doesn't depend on other things, I think it might be a good place, perhaps just to start. I am happy to include more repo collaborators, if that's a blocker.

ivirshup commented 1 year ago

Xarray already has a array wrapper type that does all this except concatentation and metadata handling: https://github.com/pydata/xarray/issues/5081.

@dcherian I think the indexing capabilities of LazilyIndexedArray may be too powerful here. My understanding was that it can handle indexing statements that kerchunk can't, since kerchunk needs all indexing to basically be picking chunks.

Though I would really love to see a split-out LazilyIndexedArray. May be able to solve a big class of issues we have over on anndata.

I do also think now that it would be conceptually neat to separate a concatenateable VirtualZarr class (/ "chunk manifest") from the kerchunk code that actually reads specific legacy file formats. I don't really have a strong opinion on this though.

I would have some interest in the in-memory class being something that can be written to a chunk manifest, but not neccesarily containing a chunk manifest. Some things I think this could help with:

Why not make that thing here in kerchunk?

I think it would make sense to have some prototyping before adding these to an established library.

TomNicholas commented 9 months ago

If we have a zarr.Array/kerchunk.Array class that supports concatenation following the array API, could we imagine just using xarray as is to replace MultiZarrToZarr?

We could make a kerchunk backend for xarray that returns a zarr.Array, the user uses familiar xarray API to concatenate/merge everything together, then kerchunk provides a utility to serialize the resulting reference files. Users could literally just open their netCDF files in the same way they normally do:

ds = xr.open_mfdataset(
    '/my/files*.nc',
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='nested',  # 'by_coords' would require actually reading coordinate data
    parallel=True,  # would use dask.delayed to generate reference dicts for each file in parallel
)

ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects directly

ds.kerchunk.to_zarr(store='out.zarr')  # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)

Pros:

Cons:

martindurant commented 9 months ago

Additional con: that would require opening and reading potentially thousands of reference sets at open_dataset time. If they all formed a regular grid such that you know where everything goes without opening them (from some naming convention or other manifest), well then you've done MultiZarrToZarr already.

TomNicholas commented 9 months ago

opening and reading potentially thousands of reference sets

Doesn't that step have to happen however we do this?

If they all formed a regular grid such that you know where everything goes without opening them

The various keyword arguments to xr.open_mfdataset allow you to distinguish between the cases of (1) I am confident that the data lies in a regular grid, please just lazily open and concatenate everything, (2) I am not confident, so please load coordinates and check they align etc. before you concatenate anything.

then you've done MultiZarrToZarr already.

In what sense? You still need to actually create the correct json reference file, or are you saying that should be trivial if you are 100% confident in the layout of the files a priori?

martindurant commented 9 months ago

Doesn't that step have to happen however we do this?

In the kerchunk model, it happens once; in the concat model, it happens every time.

You still need to actually create the correct json reference file, or are you saying that should be trivial if you are 100% confident in the layout of the files a priori?

Both: if you need a manifest, someone has to make this, and kerchunk already can for a range of cases. xarray can do it for a set of already-open datasets but it cannot persist the overall dataset state. Do we need a third way?

If we can infer position from names (or something other that is simple), then making the manifest/reference set is indeed trivial. https://nbviewer.org/github/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb was an example of the latter (4D and 5D data), no sub-file coord decode required (but the parameter space is truly massive, and a pre-pass to see which parts are populated was really helpful).

ivirshup commented 9 months ago

You still need to actually create the correct json reference file

if you need a manifest, someone has to make this, and kerchunk already can for a range of cases. xarray can do it for a set of already-open datasets but it cannot persist the overall dataset state. Do we need a third way?

I would think that the kerchunk.Array or concatenated zarr.Arrayinside the xarray object could be accessed, which would then create the manifest. So the underlying manifest creation is still handled by kerchunk (or zarr if this logic ends up there), but one could use the xarray api to do the manipulation.

If this is possible, I think it would be a very nice composition of library functionality.

martindurant commented 9 months ago

would think that the kerchunk.Array or concatenated zarr.Arrayinside the xarray object could be accessed, which would then create the manifest. So the underlying manifest creation is still handled by kerchunk (or zarr if this logic ends up there), but one could use the xarray api to do the manipulation.

Could you perhaps describe in more detail how you see the pieces fitting together? I am having trouble envisioning this.

TomNicholas commented 9 months ago

opening and reading potentially thousands of reference sets

Doesn't that step have to happen however we do this?

In the kerchunk model, it happens once; in the concat model, it happens every time.

I'm not sure I understand. Every time you do what? Open the data?

My understanding is that kerchunk currently reads each legacy file once, does the equivalent of concatenation / checking alignment in memory once (creating longer reference dict objects), then serializes those to json once. I'm suggesting the same set of steps, but just changing the interface so that the concatenation happens behind the abstraction of a kerchunk.Array/zarr.Array. The result should be similar performance, less feature duplication, and users able to use more familiar API via xarray wrapping.

If we can infer position from names (or something other that is simple), then making the manifest/reference set is indeed trivial

It might be trivial in the algorithmic sense, but I would also like it to be trivial in the usability sense. That's the strength of xr.open_mfdataset - most of the time the user just points it at a glob of files and it "just works".

Could you perhaps describe in more detail how you see the pieces fitting together?

The simplest way to imagine making this work is defining a kerchunk.Array class (similar to https://github.com/fsspec/kerchunk/issues/375) which:

  1. Has array creation routines that read from legacy file formats (perhaps consolidating them similar to https://github.com/fsspec/kerchunk/issues/376),
  2. Stores a reference dict on each kerchunk.Array instance,
  3. Defines concatenation following the array API, implemented by perhaps calling MultiZarrToZarr under the hood and creating a new kerchunk.Array instance storing the newly-concatenated reference dict,
  4. Has a .to_json method for serializing the combined reference dict after concatenation.

This doesn't seem that hard? You could prototype this without having to wait for changes to the Zarr spec (even though I personally still think that if we can later move this upstream to zarr via a chunk manifest spec we should), it doesn't require indexing to be supported, or pulling out xarray's lazy indexing functionality.

The main point is that both this kerchunk.Array suggestion and the zarr "chunk manifest" suggestion could be treated similarly by xarray (xarray doesn't have to care in which library the chunk indirection actually occurs), and hence simplify things for users.

TomNicholas commented 9 months ago

The simplest way to imagine making this work is defining a kerchunk.Array class

I had a go at this and it basically works. I didn't quite get the whole way because of an xarray issue with indexes, but I think this is a proof of principle.

Notebook here

EDIT: The KerchunkArray/zarr.Array ideas are both examples of what we might call a "ConcatenatableArray" object from xarray's perspective - see https://github.com/pydata/xarray/issues/8699#issuecomment-1925916420.

martindurant commented 9 months ago

There is a lot to look at in the notebook, and I probably don't understand xarray well enough to make to much sense of it - I hope others can have a look.

People have asked for ways to auto-kerchunk a dataset on open with xr, and you've certainly covered that (note that engine="kerchunk" already is used for passing in references sets, but overloading it might be OK).

_concat_arr_reference_dicts, This is the minimum I needed to get this one example to work - expect it to break on practically anything else!

I think this is the nub. You use the xarray API, but the concat actually happens here, no? So I'm not seeing what we're getting from xarray's concat logic at all, it's just another place to put code that already works in kerchunk. Actually getting coordinates out of a given reference set (pre-scanned) is fast, so how does xarray "check" and sort those for us? The inverse function recreate_zarr_refs is also highly specific to this combine workflow.

TomNicholas commented 9 months ago

There is a lot to look at in the notebook

The main point is that it's possible to abstract away everything kerchunk.combine does and think of it instead as xarray concatenation/merging of a new KerchunkArray class.

People have asked for ways to auto-kerchunk a dataset on open with xr, and you've certainly covered that

This is a bit different from what Andrew suggested in https://discourse.pangeo.io/t/making-kerchunk-as-simple-as-a-toggle/369. His engine='kerchunk' is opening the data as numpy-backed xarray objects but creating the references automatically along the way. My suggestion never opens the data as numpy at all, and allows users to manipulate the chunks using xr.concat/merge before they choose which reference sets to save out (so it would solve issues like #362, where xarray's UI can express things that kerchunk's combine UI can't).

You use the xarray API, but the concat actually happens here, no?

Yes KerchunkArray is responsible for the array concat. But that's all - xarray now handles merging variables, dealing with dimension / variable names, and combining attrs, all of which MultiZarrToZarr currently has to handle.

So I'm not seeing what we're getting from xarray's concat logic at all, it's just another place to put code that already works in kerchunk.

Concatenation of reference sets does already work in kerchunk, but I originally raised this issue because the user experience could be greatly improved. Refactoring to use xarray's UI instead gets us:

  1. Massive usability improvements, as users can avoid learning an entirely separate concatenation interface,
  2. Reduced maintenance burden on kerchunk (through obviating the need for the entire kerchunk.combine module),
  3. The ability to combine in ways that MultiZarrToZarr can't handle,
  4. A much clearer data model within kerchunk (this really solves #375 too),
  5. Separation of concerns between libraries,
  6. Easier testing (just test that concatenating KerchunkArrays then reading from the resultant reference set gives the same result as opening as numpy arrays and concatenating directly),
  7. (Later) free parallelization using whichever parallelization framework xarray can wrap (so dask + also cubed etc.)

Actually getting coordinates out of a given reference set (pre-scanned) is fast, so how does xarray "check" and sort those for us?

I'm not totally sure I understand what you mean (because it's not super clear to me what checks kerchunk performs and when), but we could allow xarray to read coordinates, build indexes, and use those to automatically sort datasets. That's what xr.combine_by_coords does.

The inverse function recreate_zarr_refs is also highly specific to this combine workflow.

It's specific to assuming you have represented your zarr references as xarray-wrapped KerchunkArrays yes.


I mean of course this is a big suggestion, and we don't have to do any of this, but I just think it's a pretty good idea 🤷‍♂️

martindurant commented 9 months ago

The ability to combine in ways that MultiZarrToZarr can't handle,

This would be the killer motivation for me (gets around many places where the non-existence of variable chunking in zarr is fatal). How complex would it be to demonstrate?

TomNicholas commented 9 months ago

This would be the killer motivation for me (gets around many places where the non-existence of variable chunking in zarr is fatal).

To be clear: you're asking if we could use this approach to concatenate datasets of inconsistent chunk length, and then access them via fsspec, all before the variable-length chunks ZEP is accepted?

How complex would it be to demonstrate?

If you can give me a specific example I can have a go.

martindurant commented 9 months ago

To be clear: you're asking if we could use this approach to concatenate datasets of inconsistent chunk length, and then access them via fsspec, all before the variable-length chunks ZEP is accepted?

Specifically, where the each input is a legal zarr, but perhaps the last chunk is not the same size as the others, or the sizes of chukns from one input to the next is not the same.

TomNicholas commented 9 months ago

Specifically, where the each input is a legal zarr, but perhaps the last chunk is not the same size as the others, or the sizes of chukns from one input to the next is not the same.

Okay so I think you could make general variable-length chunking work just by generalizing the .zarray/chunks attribute in each array to be like an explicit list of the lengths of each chunk along each axis (i.e. what dask.array.chunks stores). Then you just slightly generalize the concat method I defined to handle that case. (This idea is now very similar to what @dcherian was talking about with LazilyConcatenatedArray above.)

This would obviously not be a valid set of zarr attributes (until a variable-length chunks ZEP was accepted), but you could make writing these variable-length chunks .zattrs opt-in during the recreate_zarr_refs step.

martindurant commented 9 months ago

you could make general variable-length chunking work just by generalizing the .zarray/chunks attribute in each array to be like an explicit list of the lengths of each chunk along each axis

I did this within zarr as POC in https://github.com/zarr-developers/zarr-python/pull/1483 (see also https://github.com/zarr-developers/zarr-python/issues/1595 )

This all brings us to the same conversation: this general logic of which chunk goes where could be implemented in the storage layer (kerchunk/fsspec), zarr or xarray. xarray has its complex indexers, zarr has code simplicity and a natural place to store the chunks, and the storage layer is the only one where we can iterate very quickly and control everything. What we definitely don't want, though, is to repeat the work in multiple places in different ways :|

rabernat commented 9 months ago

There is lots of prior art here that would be worth studying. Specifically: NCML.

martindurant commented 9 months ago

Coming back to the OP, and ignoring the virtual arrays conversation: it occurs to me that the first 4 bullets are essentially special cases of what MZZ does. We could make them separate, simpler functions, or we could provide simple wrapper function APIs that pass the right info to MZZ.

TomNicholas commented 8 months ago

(Copying Martin's responses to some Q's of mine over on #386)

TN: Why are there multiple passes over the data [in MultiZarrToZarr's implementation]?

MD: The first is to find all the coordinate values in all of the chunks, so that in the second pass we can thereafter know, for a particular value, where in the chunk grid it fits. When using concat([...]), this ordering (on exactly one dimension) is implicit in the order of the provided list.

TN: is there a reason why MultiZarrToZarr doesn't seem to use either concatenate_arrays or merge_vars

MD: Those are much more specific and un-flexible - useful in some limited circumstances. What MZZ does cannot be cleanly decomposed into a set of steps like that.

MD: To the larger point: MZZ tries to form a sensible aggregate dataset out of all the inputs, by finding the set of coordinates in each chunk of each input (where "coordinate" can include "variable name"). This includes the various combine options of open_mfdataset, which was the inspiration.

MD: If you wanted a decomposition of what MZZ does into separate functions, you would need the user to figure out which files belong together for each subconcat along x, which then get concatenated in a separate step along y. The user would need to know the groupings and the order all the way down. That's essentially how dask's stack and concat methods work. The idea, instead, is to be able to infer those values and figure out the whole grid in one operation, whether the origin of the coordinates is variables within each dataset, file names or something else.

MD: For the specific issue here, it sounds like the logic of "is this a coordinate" isn't quite right, and given the right reproducer for testing, we should be able to address it easily.

MD: (None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)

TomNicholas commented 8 months ago

(My responses)

What MZZ does cannot be cleanly decomposed into a set of steps like that.

I disagree - I think that what MZZ does can and should be decomposed into a series of steps just like that, you just also need to include the determination of the the total ordering in the series of steps. Xarray can either be told this ordering (via the order of inputs to concat/combine_nested) or it can deduce it by reading coordinates (via combine_by_coords). In both cases the ordering happens first, then atomic concat/merge functions are called to do the actual combining.

(None of this is to say that having concat/stack/merge functions or array-standard methods, which can be applied tree-wise is a bad idea; but the workflow would be quite different and put a lot of up-front work on the caller)

The workflow would be different if the user needs to go down to the level of combine_nested/concat/merge, but it would be different in a good way: it would hand over more power to them when they need it to express their desired result. However most users wouldn't even need to put in any more up-front work, because combine_by_coords (the default for open_mfdataset) suffices for many cases.

martindurant commented 8 months ago

or it can deduce [the ordering] it by reading coordinates (via combine_by_coords).

I would say, that this is almost entirely what MZZ does right now, and I would be very interested to see how it could be done independently for a second stage of tree-like merge/combine.concat. Would the decomposition you are after be better described as

TomNicholas commented 8 months ago

Would the decomposition you are after be better described as

  • find the coordinates of each chunk of each input, and the total set of coordinate values for each coordinate of interest
  • do the combine with merge/combine/concat

Yes - this is exactly what xarray already does - you can see it clearly by how in the combine module there are two internal functions: _infer_concat_order_from_positions and _infer_concat_order_from_coords. The former corresponds to xr.combine_nested and the latter to xr.combine_by_coords. Once the concatenation order is known, both functions use the same code to perform the (multi-dimensional) series of concat/merge operations (see _combine_nd).

EDIT: And all of this nesting is conveniently hidden from the user when they just call xr.open_mfdataset.

martindurant commented 8 months ago

OK, so all this time I have been asking about how one might use xarray's machinery to serialise the set of positions of the input data, and now I know that _infer_concat_order_from_coords exists...

The remaining differences that would need to be figured out:

It sounds like you already have ideas for all of this?

TomNicholas commented 8 months ago

we support many ways to find the coordinated of a chunk where these cannot be had from its internal data, particularly based on filenames

This is just a special case of specifying the total ordering. The simplest way is just to ask the user to deduce the ordering and pass a list of datasets of the correct order to combine_nested. A more automatic way would be for the user to write a preprocess function for open_mfdataset that adds a new dimension coordinate to the dataset, which can then be used by the combine_by_coords part of open_mfdataset. One subtlety is that then preprocess would need to have access to the filename, but that seems solvable.

EDIT: Apparently we can use preprocess as-is because xarray automatically saves the filename into ds.encoding["source"]. So MZZ'ing files that are ordered by filename could just be

def add_time_dim_coord_from_filename(ds: xr.Dataset) -> xr.Dataset:
    """Adds a new dimension coordinate containing the time value read from the datasets' filename."""
    filename = ds.encoding["source"]

    # actually derive the time value from the filename string
    subst = re.search(r"\d{4}-\d{2}", filename)[0]
    time_val = cftime.DatetimeNoLeap.strptime(subst, '%Y-%m', calendar='noleap', has_year_zero=True)

    time_dim_coord = xr.DataArray(name='time', data=[time_val], dims='time')
    return ds.assign_coords(time=time_dim_coord)

ds = xr.open_mfdataset(
    '/my/files*.nc',
    engine='kerchunk',  # kerchunk registers an xarray IO backend that returns zarr.Array objects
    combine='by_coords',  # 'by_coords' would require actually reading coordinate data
    preprocess=add_time_dim_coord_from_filename,
)
ds  # now wraps a bunch of zarr.Array / kerchunk.Array objects directly
ds.kerchunk.to_zarr(store='out.zarr')

we need to know not just the location of each data input, but each of its chunks.

Keeping track of the locations of the chunks is a separate requirement, which should be the responsibility of the Array object to be concatenated. It could be handled by a kerchunk.Array (as I tried to show in my notebook above), but I actually now think if we're going to that much effort we should just go the whole way to a zarr chunk manifest instead.

martindurant commented 8 months ago

Right, we definitely need to store that manifest (or whatever it is called/however structured), we cannot be opening all of the input files each time we want access to the dataset. So open_mf is not realistic - even listing files can be expensive - we need to use those low level functions to figure out where things go and save that, ideally (I think) in the existing kerchunk format.

TomNicholas commented 8 months ago

we cannot be opening all of the input files each time we want access to the dataset. So open_mf is not realistic - even listing files can be expensive

I am not suggesting that open_mfdataset get called every time the kerchunked dataset is accessed. I am suggesting that open_mfdataset get called once (with a special set of arguments), in the same way that MultiZarrToZarr currently gets called once.

I think of what kerchunk is doing as effectively caching the result of all the ordering, checks, and concatenation that open_mfdataset does, so that you only need to do that work once, and thereafter you can open instantly just using open_dataset('store.zarr'). The actual cached information is the kerchunk format / chunk manifest file in the zarr store written to disk.

TomNicholas commented 8 months ago

[that information should be saved] ideally (I think) in the existing kerchunk format.

That's a separate question from the whole "split everything into array operations that can be wrapped and applied with xarray API" suggestion. It's the same thing as asking do we combine KerchunkArray objects that are then serialized via the existing kerchunk format, or do we combine new VirtualZarrArray objects that are then serialized into valid zarr stores (requiring extensions to the Zarr spec).