COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 7 forks source link

Analysis-ready chunking of diagnostic output files #203

Open aekiss opened 3 months ago

aekiss commented 3 months ago

Following from @Thomas-Moore-Creative's talk today, we should think about the NetCDF chunking we use to write to disk, so that the native chunking is OK for typical workflows.

Note that in a compressed, chunked NetCDF file, if you access any data in a chunk, the whole chunk needs to be read and uncompressed. So that can be a pitfall if the chunking doesn't match the access requirements, e.g. chunks are too big in the wrong dimensions. e.g. we had that problem with ERA5 forcing in ACCESS-OM2: https://github.com/COSIMA/access-om2/issues/242

Maybe we should set up a discussion/poll on the forum?

Related:

Thomas-Moore-Creative commented 3 months ago

@aekiss - after all my bluster about how important the choice of "native chunking" on the raw output is, what do we know about the limitations ( if any) on different models ability to control chunking of output at run-time? Where do modellers have that control in, say, MOM6? Is that dependent on / limited by how the model tiling is setup?

A recent conversation I had with @dougiesquire mused about choosing native chunking that was suited for and facilitated easier rechunking later. One of the problems that comes up is if you, for example, have very large chunks and are forced to load into memory most or all of the dataset to rechunk into another chunking arrangement.

That being said I'm not clear what the current COSIMA native chunking is and if it would need or benefit from change? ( other products I've come across very much do )

aekiss commented 3 months ago

Good questions.

In terms of output directly from the model components,

Model runs are broken into short segments to fit into queue limits (so segments are shortest at high resolution, e.g. a few months) so post-processing would be required to change the chunking in time.

aekiss commented 3 months ago

The other consideration is the impact of chunking on IO performance of the model itself (which can become a bottleneck at high resolution). There's a lot of discussion of this in https://gmd.copernicus.org/articles/13/1885/2020/

It would be nice if there was a compromise that worked well both for runtime performance and analysis, but maybe these are incompatible and raw model outputs would require post-processing to suit analysis.

anton-seaice commented 3 months ago

I believe MOM chunksizes are set in the fml namelist:

&fms2_io_nml
    ncchksz = 4194304
...

which is 4MB. I think part of the goal in having that size quite small is that it avoids splitting the chunks during analysis as much as practical (and some other reason about cache sizes I guess?)

It's hard to imagine model output having a chunksizes in time of anything other than 1. Like either it needs:

So I think its a question of how much extra time do we want running the model, vs how much extra time is it in analysis ?

angus-g commented 3 months ago

I think that is a poorly-named parameter that refers only to the internal library chunking (and maybe even only for NetCDF classic files, rather than the HDF5-backed NetCDF4 files). The per-dimension chunking is defined in netcdf var_def calls, and needs an array of chunksizes, rather than figuring it out from an overall chunk size. I think it is indeed the case that it depends on the IO_LAYOUT in the case of diagnostic output.

anton-seaice commented 3 months ago

Thanks Angus! We might need to revisit ncchksz which is more of a cache size when we tune the IO_LAYOUT. And that makes sense the chunksize related to IO_LAYOUT in x/y