MITgcm / xmitgcm

Read MITgcm mds binary files into xarray
http://xmitgcm.readthedocs.io
MIT License
56 stars 65 forks source link

Replace this package with a VirtualiZarr reader? #337

Open TomNicholas opened 1 month ago

TomNicholas commented 1 month ago

I don't know anything really about the format of MITgcm output files other than that they are some bespoke binary format, but I can't help wonder if it would actually be easier to create a cloud-optimized version of MITgcm data by writing a reader for virtualizarr (i.e. a kerchunk reader) rather than actually converting the binary data to zarr.

The advantages would be that

It would involve essentially rewriting this function https://github.com/MITgcm/xmitgcm/blob/63ba7511c6ada3bb7c56e4c6f7a3f770c9f9c62f/xmitgcm/utils.py#L87 to look like either one of the kerchunk readers or ideally more like this https://github.com/zarr-developers/VirtualiZarr/pull/113

Because it seems MITgcm output already separates metadata from data to some degree this could potentially work really nicely...

See also https://github.com/zarr-developers/VirtualiZarr/issues/218

One downside of that approach would be the inability to alter the chunking though.

cc @cspencerjones

TomNicholas commented 1 month ago

Turns out there is already an issue discussing something very similar (which didn't appear when I searched "kerchunk") - see https://github.com/MITgcm/xmitgcm/issues/28#issuecomment-2284292414.

cspencerjones commented 2 weeks ago

I've been thinking about this, and I'm not 100% sure that it's a good idea in the end. The main issue is that most MITgcm output is not compressed at all, so direct upload to the cloud may not be something we want to encourage, especially for realistic geometry simulations which contain a lot of land (compression usually does not reduce the size of ocean output very much). The upside of the format is that flexible chunking should be possible in theory.

LLC2160 & 4320 data is in a bespoke "shrunk" (still binary) format, where the land points have been removed, so further compression would have very limited benefit. But reading it would require writing code that's very specific to this dataset. I do not believe that further datasets will be generated in this bespoke format. Some of the data access problem with this data has nothing to do with format and is simply caused by the limited bandwidth out of Pleiades. Still, given the choice between a general MITgcm reader and a more specific reader for LLC2160/4320, I think a more specific reader would be most useful because this data is still by far the heaviest lift most people are doing, and many people cannot use the data because of how difficult access still is. (This is all just my opinion and I am prepared to hear other arguments)

rabernat commented 2 weeks ago

I actually started something like this three years ago! https://github.com/rabernat/mds2zarr - of course VirtualiZarr is much better and more robust approach.

I agree with @cspencerjones that the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

However, that is really an edge case--most "normal" MDS data output from MITgcm should be perfectly fine as uncompressed flat binary.

TomNicholas commented 2 weeks ago

the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

This seems like an analogous problem to https://github.com/zarr-developers/zarr-specs/issues/303 - i.e. it could be solved by defining a special zarr codec that is specific to this data format.

rabernat commented 2 weeks ago

Except it's really complicated because the "codec" for decoding each array relies on an external dataset (the null mask) which doesn't even have the same shape as the data. This breaks many of the abstractions implicit in the "codec" interface.