earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
281 stars 15 forks source link

Reading virtual references back out into VirtualiZarr Manifests #104

Open TomNicholas opened 1 month ago

TomNicholas commented 1 month ago

I would like to be able to read virtual references back out from an icechunk store into VirtualiZarr ManifestArray objects.

Note this issue is the icechunk equivalent of https://github.com/zarr-developers/VirtualiZarr/issues/118, which is about reading kerchunk references into ManifestArray objects.

The main use case is appending new data to an existing store (see https://github.com/zarr-developers/VirtualiZarr/issues/21#issuecomment-2114001053), so that when some new data arrives (e.g. a new grib file with today's weather data), I can add an updated snapshot just with something like:

import virtualizarr as vz

# avoids re-extracting all the metadata from all the past grib files, so should be quick
existing_vds = vz.open_virtual_dataset(icechunkstore, reader='icechunk')

new_vds = vz.open_virtual_dataset('todays_weather.grib', reader='grib')

updated_vds = xr.concat([existing_vds, new_vds], dim='time')

# commit new snapshot that includes today's data
# requires https://github.com/earth-mover/icechunk/issues/103
updated_vds.virtualize.to_icechunk(icechunkstore)
icechunkstore.commit('<todays-date>')

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store, ideally as a vz.ManifestArray or something I can cheaply coerce to one (see ChunkManifest.from_arrays()).

Writing the updated references as a new snapshot also requires https://github.com/earth-mover/icechunk/issues/103.

(I guess the .virtualize.to_icechunk method might also need to know to do array.resize in this example... (see the Append example in this notebook.)

Running that above snippet as a cron job / event-driven serverless function should go a long way towards making ingestion of regularly-updated data archives easier. (cc @mpiannucci)

This feature might also be useful to allow using icechunk as a serialization format during large serverless reductions (xref https://github.com/zarr-developers/VirtualiZarr/issues/123).

cc @paraseba

rabernat commented 1 month ago

In order to implement that Icechunk reader for virtualizarr I would need some API for getting all virtual (and non-virtual) references for a snapshot back out of the Icechunk store

I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?

Edit: as a first step towards virtualizarr interoperability, I agree that just doing a bulk rewrite is fine. But longer-term, I think more efficient strategies are possible. In general, I'd like libraries using Icechunk (like Virtualizarr) to only have to update / add new or changed references, rather than just rewriting the entire array. It's Icechunk's job to decide how to split this into one or more manifests. I'm curious what sort of API we would need to expose to enable this.

TomNicholas commented 1 month ago

I'm wondering if this is truly needed. Shouldn't it be possible to append new references without rewriting the manifest for the whole array?

I was also wondering about this. Kerchunk does have a concept of appending to kerchunk parquet references. I think it comes down to whether or not the "appending" (which in virtualizarr space is a concatenation operation in the example above) happens in user/virtualizarr API or inside the icechunk store.

My example above assumes you want to do this in virtualizarr API. But if we instead try to copy the existing xarray ds.to_zarr API then it might look more like

new_vds.virtualize.to_icechunk(icechunkstore, append_dim='time')

If we had pluggable writer backends for xarray we could maybe even make this work

new_vds.to_zarr(icechunkstore, append_dim='time', engine='virtualizarr')

which better reflects the fact that in virtualizarr an xr.Dataset can actually contain both virtual ManifestArrays and non-virtual numpy arrays (which represent data to be "inlined" in kerchunk references).

At some point the concatenate/append operation should check that the dimensions are all consistent, but that could reasonably be done in either library.

Regardless I do still think that being able to extract all the references for one snapshot might be useful.

abarciauskas-bgse commented 3 weeks ago

Should we add a note that it is not currently possible to append virtual refs to icechunk stores in https://icechunk.io/icechunk-python/virtual/ and point to this issue? cc @mpiannucci

TomNicholas commented 3 weeks ago

As far as I can tell (and this is the first time I'm reading this code so please correct me if this is wrong), the append_dim in xarray's ds.to_zarr is basically syntactic sugar for resizing an array and setting new chunks. So to support appending virtual chunks then Icechunk doesn't have to do anything beyond what set_virtual_ref already supports - the work to be done is in VirtualiZarr to recreate the append_dim logic in vds.virtualize.to_icechunk.

https://github.com/pydata/xarray/blob/main/xarray%2Fbackends%2Fzarr.py#L857

mpiannucci commented 3 weeks ago

@TomNicholas is correct. Appending in zarr is really just sugar around resizing an existing array and pointing it at existing chunks. An exmaple of appending to an icechunk store is at the bottom of this notebook.

abarciauskas-bgse commented 3 weeks ago

Ok thanks for explaining how the append method is working @TomNicholas - but if I understand correctly there is no way to append virtual refs at the moment, so I would still see a benefit to others to make this clear in the documentation.

Is there any plans to work on this feature in VirtualiZarr @TomNicholas @mpiannucci ?

TomNicholas commented 3 weeks ago

I would still see a benefit to others to make this clear in the documentation.

:+1:

Is there any plans to work on this feature in VirtualiZarr

I would very much like to see it, but don't personally need it (yet) so will be prioritizing other things. It's a good thing for somebody else to add, especially someone who is (or wants to become) familiar with xarray's existing to_zarr implementation.

TomNicholas commented 3 weeks ago

I'm going to use https://github.com/zarr-developers/VirtualiZarr/issues/21 to track the append_dim idea in VirtualiZarr.

Separately I still think that being able to pull virtual references back out would be useful. Imagine for example someone makes a public icechunk store containing virtual references, then someone from a different organisation wants to refer to those references - they would need to pull them out. (Whether or not people creating catalogs like this is a good idea is another question, but I bet people will ask for that.)

TomNicholas commented 3 weeks ago

@rabernat just said in the Pangeo showcase that icechunk is not yet at a state where it guarantees that references written with the current spec will be readable even with future versions of this library. To me that makes the feature of reading virtual references back out into VirtualiZarr actually quite important to have sooner rather than later. If that existed, then if the icechunk store changed on disk, I could always roll back to a version of the icechunk library that could read the references in order to suck them out then re-save them in the new format.

mpiannucci commented 3 weeks ago

Can you propose what the ideal API for this would be? It gets confusing for me thinking about how there can be mixed virtual and native chunks

TomNicholas commented 3 weeks ago

It gets confusing for me thinking about how there can be mixed virtual and native chunks

Yeah good point. All variables in an icechunk store are potentially available as actual loaded data, or as references, so I would say that there should be a new virtualizarr reader, such that

vds = open_virtual_dataset(icechunkstore, loadable_variables=...)

returns an xr.Dataset containing (lazy) numpy arrays for the loadable_variables, and ManifestArrays for all the other variables.

I thought maybe we could imagine this function returning virtual vs native variables based on how they were stored in icechunk, but IIUC that's actually not possible right now because a single zarr array could be backed by some virtual and some native chunks, which VirtualiZarr has no way to represent in one array.

The API needed for icechunk-python at its most basic is just .get_virtual_ref, the inverse of .set_virtual_ref. Again it would be more efficient to get back a whole ManifestArray of chunk references in one call but that's just a performance optimization.

mpiannucci commented 3 weeks ago

Cool. Would it return None if the chunk is native? Or throw an error?

paraseba commented 3 weeks ago

I wanted to mention that having mixed virtual and non virtual chunks in a single array, is exactly what we want to enable: people starting from a virtual dataset and, with time, update the dataset in Icechunk (which would generate non-vrtual chunks). So we expect this case to be very common.

paraseba commented 3 weeks ago

I support adding get_virtual_ref, but it could be more useful to have get_chunk_ref that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal". Then the client code can filter as needed. For reference, this is what get_chunk_ref returns in Rust:

pub enum ChunkPayload {
    Inline(Bytes),
    Virtual(VirtualChunkRef),
    Ref(ChunkRef),
}
TomNicholas commented 3 weeks ago

Would it return None if the chunk is native?

No, it would still return the virtual reference to that chunk, the reference would just refer to deep within this icechunk store. IIUC icechunk chunks are immutable once written, so referring to a chunk in an icechunk store from elsewhere is exactly as reliable as referring to a chunk in a traditional zarr store / legacy-formatted file.

I support adding get_virtual_ref but it could be more useful to have get_chunk_ref that gives you the whole information, including the fact that it could be Virtual, Inline or "Normal".

That would be fine for VirtualiZarr too (and maybe more useful for reasons I can't think of yet).

paraseba commented 3 weeks ago

Inline chunks I think are the biggest issue here. As @TomNicholas says, "normal" chunks can also be seen as virtual chunks, pointing to the icechunk store. But inline ones live only in the manifest. We could provide a way to reference them via an offset/length, but it would be pretty tricky and ugly.

TomNicholas commented 3 weeks ago

Right. Should arrays containing inlined chunks always be directly loaded then? I suppose VirtualiZarr might have to raise a warning if that happened for variables that were supposed to be virtual... Also to do that VirtualiZarr would need a way to easily ask Icechunk if an array contained any inlines chunks. Tbh this makes inlined chunks seem like a leaky abstraction.

paraseba commented 3 weeks ago

Tbh this makes inlined chunks seem like a leaky abstraction.

They are, I don't love them, they are leaky and probably a premature optimization.

We could make IC not use inline chunks unless the user overrides a config, I suppose.

mpiannucci commented 3 weeks ago

Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk

TomNicholas commented 3 weeks ago

Inline chunks don't matter to virtualizarr at all. To virtualizarr it's the same as a native chunk

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

TomNicholas commented 3 weeks ago

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

They are, I don't love them, they are leaky and probably a premature optimization.

It's unclear to me that they are necessary at all. Why can't you just "inline" a variable by storing it as a single native chunk? That requires an extra get request I guess? Or because then you can't cache it in the rust client?

paraseba commented 3 weeks ago

That requires an extra get request I guess?

This. They are useful for things like coordinate arrays.

mpiannucci commented 3 weeks ago

Except we're talking about here having native chunks also be extracted as virtual references, which @paraseba is saying is harder to provide for inclined chunks than for native chunks.

Got itttt sorry I misunderstood :)