Replace this package with a VirtualiZarr reader?

TomNicholas commented 3 months ago

I don't know anything really about the format of MITgcm output files other than that they are some bespoke binary format, but I can't help wonder if it would actually be easier to create a cloud-optimized version of MITgcm data by writing a reader for virtualizarr (i.e. a kerchunk reader) rather than actually converting the binary data to zarr.

The advantages would be that

if you want to make the data available to xarray users, even in the cloud, you don't have to alter or duplicate the original data (for cloud access you could just upload the original output files to a bucket with no alterations),
that reader would work for any MITgcm output (so effectively replacing most of xMITgcm),
it would mean that creating the over-arching actual virtual zarr store becomes the same problem that everyone else has (that the rest of the virtualizarr package is meant to solve).

It would involve essentially rewriting this function https://github.com/MITgcm/xmitgcm/blob/63ba7511c6ada3bb7c56e4c6f7a3f770c9f9c62f/xmitgcm/utils.py#L87 to look like either one of the kerchunk readers or ideally more like this https://github.com/zarr-developers/VirtualiZarr/pull/113

Because it seems MITgcm output already separates metadata from data to some degree this could potentially work really nicely...

One downside of that approach would be the inability to alter the chunking though.

cc @cspencerjones

TomNicholas commented 3 months ago

Turns out there is already an issue discussing something very similar (which didn't appear when I searched "kerchunk") - see https://github.com/MITgcm/xmitgcm/issues/28#issuecomment-2284292414.

cspencerjones commented 3 months ago

I've been thinking about this, and I'm not 100% sure that it's a good idea in the end. The main issue is that most MITgcm output is not compressed at all, so direct upload to the cloud may not be something we want to encourage, especially for realistic geometry simulations which contain a lot of land (compression usually does not reduce the size of ocean output very much). The upside of the format is that flexible chunking should be possible in theory.

LLC2160 & 4320 data is in a bespoke "shrunk" (still binary) format, where the land points have been removed, so further compression would have very limited benefit. But reading it would require writing code that's very specific to this dataset. I do not believe that further datasets will be generated in this bespoke format. Some of the data access problem with this data has nothing to do with format and is simply caused by the limited bandwidth out of Pleiades. Still, given the choice between a general MITgcm reader and a more specific reader for LLC2160/4320, I think a more specific reader would be most useful because this data is still by far the heaviest lift most people are doing, and many people cannot use the data because of how difficult access still is. (This is all just my opinion and I am prepared to hear other arguments)

rabernat commented 3 months ago

I actually started something like this three years ago! https://github.com/rabernat/mds2zarr - of course VirtualiZarr is much better and more robust approach.

I agree with @cspencerjones that the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

However, that is really an edge case--most "normal" MDS data output from MITgcm should be perfectly fine as uncompressed flat binary.

TomNicholas commented 3 months ago

the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

This seems like an analogous problem to https://github.com/zarr-developers/zarr-specs/issues/303 - i.e. it could be solved by defining a special zarr codec that is specific to this data format.

rabernat commented 3 months ago

Except it's really complicated because the "codec" for decoding each array relies on an external dataset (the null mask) which doesn't even have the same shape as the data. This breaks many of the abstractions implicit in the "codec" interface.

MITgcm / xmitgcm

Replace this package with a VirtualiZarr reader? #337