Open TomNicholas opened 1 month ago
In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store
I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?
Edit: as a first step towards virtualizarr interoperability, I agree that just doing a bulk rewrite is fine. But longer-term, I think more efficient strategies are possible. In general, I'd like libraries using Icechunk (like Virtualizarr) to only have to update / add new or changed references, rather than just rewriting the entire array. It's Icechunk's job to decide how to split this into one or more manifests. I'm curious what sort of API we would need to expose to enable this.
I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?
I was also wondering about this. Kerchunk does have a concept of appending to kerchunk parquet references. I think it comes down to whether or not the "appending" (which in virtualizarr space is a concatenation operation in the example above) happens in user/virtualizarr API or inside the icechunk store.
My example above assumes you want to do this in virtualizarr API. But if we instead try to copy the existing xarray ds.to_zarr
API then it might look more like
new_vds.virtualize.to_icechunk(icechunkstore, append_dim='time')
If we had pluggable writer backends for xarray we could maybe even make this work
new_vds.to_zarr(icechunkstore, append_dim='time', engine='virtualizarr')
which better reflects the fact that in virtualizarr an xr.Dataset
can actually contain both virtual ManifestArrays
and non-virtual numpy arrays (which represent data to be "inlined" in kerchunk references).
At some point the concatenate/append operation should check that the dimensions are all consistent, but that could reasonably be done in either library.
Regardless I do still think that being able to extract all the references for one snapshot might be useful.
Should we add a note that it is not currently possible to append virtual refs to icechunk stores in https://icechunk.io/icechunk-python/virtual/ and point to this issue? cc @mpiannucci
As far as I can tell (and this is the first time I'm reading this code so please correct me if this is wrong), the append_dim
in xarray's ds.to_zarr
is basically syntactic sugar for resizing an array and setting new chunks. So to support appending virtual chunks then Icechunk doesn't have to do anything beyond what set_virtual_ref
already supports - the work to be done is in VirtualiZarr to recreate the append_dim
logic in vds.virtualize.to_icechunk
.
https://github.com/pydata/xarray/blob/main/xarray%2Fbackends%2Fzarr.py#L857
@TomNicholas is correct. Appending in zarr is really just sugar around resizing an existing array and pointing it at existing chunks. An exmaple of appending to an icechunk store is at the bottom of this notebook.
Ok thanks for explaining how the append method is working @TomNicholas - but if I understand correctly there is no way to append virtual refs at the moment, so I would still see a benefit to others to make this clear in the documentation.
Is there any plans to work on this feature in VirtualiZarr @TomNicholas @mpiannucci ?
I would still see a benefit to others to make this clear in the documentation.
:+1:
Is there any plans to work on this feature in
VirtualiZarr
I would very much like to see it, but don't personally need it (yet) so will be prioritizing other things. It's a good thing for somebody else to add, especially someone who is (or wants to become) familiar with xarray's existing to_zarr
implementation.
I'm going to use https://github.com/zarr-developers/VirtualiZarr/issues/21 to track the append_dim
idea in VirtualiZarr.
Separately I still think that being able to pull virtual references back out would be useful. Imagine for example someone makes a public icechunk store containing virtual references, then someone from a different organisation wants to refer to those references - they would need to pull them out. (Whether or not people creating catalogs like this is a good idea is another question, but I bet people will ask for that.)
@rabernat just said in the Pangeo showcase that icechunk is not yet at a state where it guarantees that references written with the current spec will be readable even with future versions of this library. To me that makes the feature of reading virtual references back out into VirtualiZarr actually quite important to have sooner rather than later. If that existed, then if the icechunk store changed on disk, I could always roll back to a version of the icechunk library that could read the references in order to suck them out then re-save them in the new format.
Can you propose what the ideal API for this would be? It gets confusing for me thinking about how there can be mixed virtual and native chunks
It gets confusing for me thinking about how there can be mixed virtual and native chunks
Yeah good point. All variables in an icechunk store are potentially available as actual loaded data, or as references, so I would say that there should be a new virtualizarr reader, such that
vds = open_virtual_dataset(icechunkstore, loadable_variables=...)
returns an xr.Dataset
containing (lazy) numpy arrays for the loadable_variables
, and ManifestArrays
for all the other variables.
I thought maybe we could imagine this function returning virtual vs native variables based on how they were stored in icechunk, but IIUC that's actually not possible right now because a single zarr array could be backed by some virtual and some native chunks, which VirtualiZarr has no way to represent in one array.
The API needed for icechunk-python at its most basic is just .get_virtual_ref
, the inverse of .set_virtual_ref
. Again it would be more efficient to get back a whole ManifestArray
of chunk references in one call but that's just a performance optimization.
Cool. Would it return None if the chunk is native? Or throw an error?
I wanted to mention that having mixed virtual and non virtual chunks in a single array, is exactly what we want to enable: people starting from a virtual dataset and, with time, update the dataset in Icechunk (which would generate non-vrtual chunks). So we expect this case to be very common.
I support adding get_virtual_ref
, but it could be more useful to have get_chunk_ref
that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal". Then the client code can filter as needed. For reference, this is what get_chunk_ref
returns in Rust:
pub enum ChunkPayload {
Inline(Bytes),
Virtual(VirtualChunkRef),
Ref(ChunkRef),
}
Would it return
None
if the chunk is native?
No, it would still return the virtual reference to that chunk, the reference would just refer to deep within this icechunk store. IIUC icechunk chunks are immutable once written, so referring to a chunk in an icechunk store from elsewhere is exactly as reliable as referring to a chunk in a traditional zarr store / legacy-formatted file.
I support adding
get_virtual_ref
but it could be more useful to haveget_chunk_ref
that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal".
That would be fine for VirtualiZarr
too (and maybe more useful for reasons I can't think of yet).
Inline chunks I think are the biggest issue here. As @TomNicholas says, "normal" chunks can also be seen as virtual chunks, pointing to the icechunk store. But inline ones live only in the manifest. We could provide a way to reference them via an offset/length, but it would be pretty tricky and ugly.
Right. Should arrays containing inlined chunks always be directly loaded then? I suppose VirtualiZarr might have to raise a warning if that happened for variables that were supposed to be virtual... Also to do that VirtualiZarr would need a way to easily ask Icechunk if an array contained any inlines chunks. Tbh this makes inlined chunks seem like a leaky abstraction.
Tbh this makes inlined chunks seem like a leaky abstraction.
They are, I don't love them, they are leaky and probably a premature optimization.
We could make IC not use inline chunks unless the user overrides a config, I suppose.
Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk
Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk
Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.
Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.
They are, I don't love them, they are leaky and probably a premature optimization.
It's unclear to me that they are necessary at all. Why can't you just "inline" a variable by storing it as a single native chunk? That requires an extra get request I guess? Or because then you can't cache it in the rust client?
That requires an extra get request I guess?
This. They are useful for things like coordinate arrays.
Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.
Got itttt sorry I misunderstood :)
I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr
ManifestArray
objects.Note this issue is the icechunk equivalent of https://github.com/zarr-developers/VirtualiZarr/issues/118, which is about reading kerchunk references into
ManifestArray
objects.The main use case is appending new data to an existing store (see https://github.com/zarr-developers/VirtualiZarr/issues/21#issuecomment-2114001053), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:
In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a
vz.ManifestArray
or something I can cheaply coerce to one (seeChunkManifest.from_arrays()
).Writing the updated references as a new snapshot also requires https://github.com/earth-mover/icechunk/issues/103.
(I guess the
.virtualize.to_icechunk
method might also need to know to doarray.resize
in this example... (see the Append example in this notebook.)Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)
This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref https://github.com/zarr-developers/VirtualiZarr/issues/123).
cc @paraseba