earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
221 stars 13 forks source link

Use Case: [C]Worthy OAE dataset #119

Open TomNicholas opened 3 weeks ago

TomNicholas commented 3 weeks ago

This issue but for icechunk: https://github.com/zarr-developers/VirtualiZarr/issues/132

I was originally planning to virtualize this [C]Worthy dataset and save the references using the kerchunk parquet format, but now the timelines have changed such that both icechunk and the [C]Worthy OAE atlas are planned to release on the same day (Oct 15th 2024)! So I could use icechunk's format instead (or just write both)...

I think it's pretty unlikely that virtualizing using icechunk happens by then (I have enough work to do to just release the un-virtualized version of the dataset) but I do need to do all this by December anyway because I submitted this as a talk to AGU 🙃 Regardless of when this dataset is a good real-world test case for icechunk - as I said in https://github.com/zarr-developers/VirtualiZarr/issues/132:

If we can virtualize this we should be able to virtualize most things 💪

Wishlist:

dcherian commented 3 weeks ago

Datetime support

you probably don't need this since Xarray encodes datetimes by default.

TomNicholas commented 3 weeks ago

you probably don't need this since Xarray encodes datetimes by default.

You mean if I save the time coordinates as non-virtual zarr arrays then xarray's decoding should handle this as normal?

dcherian commented 3 weeks ago

if you go through xarray, yes.

rabernat commented 3 weeks ago

Should also work with virtual data. Usual CF datasets use int as the raw array dtype and then have attributes like units: days since X, which Xarray / CFTime decode to python datetimes. There is no native datetime type in netcdf.