Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox.

Thomas-Moore-Creative commented 4 months ago

Title: Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox.

Objective: efficiently **generate the following set of descriptive statistical monthly and daily climatologies ( including “all-time” and gappy data masked monthly by ENSO phase based on ONI index) on BRAN2020 output for a single 3D variable at a daily** timestep:
1. mean
2. median
3. standard deviation
4. min
5. max
6. quantile
Data description: BRAN2020 (2020 version of the Bluelink ReANalysis) is an ocean reanalysis that combines observations with an eddy-resolving, near-global ocean general circulation model to produce a four-dimensional estimate of the ocean state. Overall BRAN2020 is order 50TB of float32 data over nearly 9000 netcdf file assets in total. For this problem we’ll focus on only the daily 3D temp variable.
1. data source: 366 individual netcdf files on Australian HPC system Gadi
2. this is accessed directly on the Gadi supercomputer ( national science resources - not public ) using a custom beta catalog I wrote using intake-esm and ecgtools - https://github.com/Thomas-Moore-Creative/BRAN2020-intake-catalog
3. netcdf files are each 5GB, compressed, with the potential temperature temp variable stored in a short data type with native chunking like: temp:_ChunkSizes = 1, 1, 300, 300
4. each individual netcdf file has one month of daily data and so the length of Time varies from 28 to 31 for any one of the 366 files concatenated together. The compressed 5GB single netcdf file becomes 34GB in loaded float32
5. When loaded the temp variable xarray object is `Time: 11138 st_ocean: 51 yt_ocean: 1500 xt_ocean: 3600and **11.16TB** infloat32`
6. At native chunk sizes the default chunks load at 351.56 kiB, so xarray_open_kwargs are employed to increase that, however I believe that Time chunks on lazy load can’t be larger than the size of each individual netcdf file?
  1. so with xarray_open_kwargs of Time:31 we’ll get actual chunks like: [31,28,31,30,31,30 . . . ] and some years will be [ …. 31,29,31,30 . . .]
Resources: we are privileged and grateful to have meaningful HPC resources that could make working with 11 or 50TB possible given good code practice.
1. A goal is to make things work for a 28 cpu & 252GB node
2. Larger memory queues are possible but less available
Desired outcome: develop a detailed understanding of best-practice for running all of these statistics (including median and quantile ) as both monthly and daily climatologies from daily data organised as 366 x one-month-per-netcdf file.
Questions on considerations for best-practice?
1. What chunking xarray_open_kwargs do we use on the initial loading .open_mfdataset?
  1. Chosen based on some memory metric? ( some fraction of the memory per worker )?
  2. Chosen to be some mulitple of the native ChunkSizes
  3. Chosen to meet some requirements for better groupby operations via flox?
2. If these calculations will be done in different ways, many times over, and by many people would rechunking and writing to an “analysis-ready” zarr collection be useful. If so what Time chunking to choose? all-time ({’Time’:-1} or a monthly frequency?
3. Which flox methods are best to choose?

Thomas-Moore-Creative commented 4 months ago

Keeping in mind these helpful suggestions from @dcherian here: https://github.com/Thomas-Moore-Creative/Climatology-generator-demo/issues/16#issuecomment-2080144828

/// quote:@dcherian ///

Some general comments (applicable to latest flox/xarray)

If you are not grouping by multiple variables, you can use the much nicer ds.groupby("time.month").mean() syntax and it will use flox automatically if installed.
Over the past few months, I've worked on setting method automatically looking at how the groups are distributed across chunks. You should not have to set method. This will let flox choose what's appropriate to whatever chunking you have.
That said (2) runs on heuristics so it's possible that it doesn't make the right choice. Details will help, particularly what are you grouping by, and how is the big array chunked along the core dimensions of the groupby operation.
Are you running in to trouble with the groupby reduction, or on calculating anomalies w.r.t the grouped-reduced values? With dask, you want to do those two steps separately, otherwise memory issues will often result.
It's tempting to simply say .chunk(time=30) for e.g. but really you should think about rechunking to a frequency (e.g. monthly or two-monthly). We don't have nice syntax for this yet (https://github.com/pydata/xarray/issues/7559) but you can quite easily figure out the appropriate chunk tuple with ds.time.resample(time="2M").count(). If properly done, flox will then choose either "cohorts" or "blockwise" automatically for you, and save some memory. Here's an example: https://flox.readthedocs.io/en/latest/user-stories/climatology.html#rechunking-data-for-cohorts

Thomas-Moore-Creative commented 4 months ago

What chunking `xarray_open_kwargs` do we use on the initial loading `.open_mfdataset`?

See: Dask chunking Best-Practices

Both: Select a good chunk size

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.

and: Orient your chunks

When reading data you should align your chunks with your storage format. Most array storage formats store data in chunks themselves. If your Dask array chunks aren’t multiples of these chunk shapes then you will have to read the same data repeatedly, which can be expensive. Note though that often storage formats choose chunk sizes that are much smaller than is ideal for Dask, closer to 1MB than 100MB. In these cases you should choose a Dask chunk size that aligns with the storage chunk size and that every Dask chunk dimension is a multiple of the storage chunk dimension.

So for example if we have an HDF file that has chunks of size (128, 64), we might choose a chunk shape of (1280, 6400).

Note that if you provide chunks='auto' then Dask Array will look for a .chunks attribute and use that to provide a good chunking.

For the 366 BRAN2020 temp netcdf files chunks='auto' yield 88MB float32 chunks at (4,4,1200,1200)

CleanShot 2024-04-30 at 08 40 00@2x

print_chunks

Time chunks: (4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2)
st_ocean chunks: (4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3)
yt_ocean chunks: (1200, 300)
xt_ocean chunks: (1200, 1200, 1200)

Print # of dask tasks:

num_tasks = len(DS.temp.data.dask)
print(f"Number of tasks : {num_tasks}")

Number of tasks : 453390

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'temp' has incompatible dask chunks:

However as each of the 366 netcdf files has a different length Time dimension (28,29,30, or 31) which isn't divisible by the chunking {Time:4}. The resulting chunks are not compatible with writing intermediate zarr ARD collections as all chunks must be the same size except for the last.

Thomas-Moore-Creative commented 4 months ago

Let's assume that the benefit of working from a re-loaded zarr collection written once from the 366 netcdf files is much more performant than an object loaded via xr.open_mfdataset from the 366 netcdf files. (we'll test this later once we have the zarr collections to compare)

Given that the native netcdf storage format is "monthly" files with 28,29,30, or 31 Time steps and that both 29 and 31 are prime numbers {'Time':1} is the only time chunking available to us for the first write to zarr

xarray_open_kwargs = {"chunks": {"Time": 1, "st_ocean": 20, "xt_ocean": 1200, "yt_ocean": 1200}} yields 109.86 MiB chunks - 200,484 chunks total and number of tasks : 401,334

The number of chunks, number of tasks, and size of chunks for this open with kwargs here is similar to the chunks='auto' job.

Next step: write "analysis ready data" (ARD) zarr collection from the 366 netcdf files using xarray_open_kwargs = {"chunks": {"Time": 1, "st_ocean": 20, "xt_ocean": 1200, "yt_ocean": 1200}}

dcherian commented 4 months ago

(1) I would just call ds.chunk({"Time": 4}).to_zarr(). This will get you even size chunks. (2) Your files are setup nicely for climatological calculations (both monthly (ds.groupby("Time.month") and daily ds.groupby(ds.Time.dt.strftime('%h-%d'))) with method="cohorts". median and quantile won't work, but the rest should just work(TM). In my checks flox does detect the patterns and use method="cohorts" automatically. (3) I suspect running ds, = dask.optimize(ds) right after open_mfdataset may help a bit by reducing the number of 'layers' in the graph. (4) Your objectives here are to reduce the data in time. Choosing {"Time": -1} chunksize is a good idea for both daily and monthly climatologies. By choosing the biggest size along time, you'll reduce the data immediately upon read. To write these to Zarr, you can again rechunk to a uniform size just before writing.

Thomas-Moore-Creative commented 4 months ago

(4) Your objectives here are to reduce the data in time. Choosing {"Time": -1} chunksize is a good idea for both daily and monthly climatologies. By choosing the biggest size along time, you'll reduce the data immediately upon read. To write these to Zarr, you can again rechunk to a uniform size just before writing.

Thanks heaps for all the advice above, @dcherian. Re: (4) We're blessed with a fair bit of /scratch space and my current thinking for all these kind of time reduction operations is that it's always best to first convert the many netcdf files into a {"Time": -1} zarr collection for many people to compute against many times? Is this what you're saying here in (4)? And by the last sentence are you flagging that the climatological output will end up with small chunks and that's easy to rechunk before writing out the results?

It sounds like my next step is figuring out the best possible way to get that very large {"Time": -1} zarr collection written from the 11TB's across 366 netcdfs without blowing up the cluster?

dcherian commented 4 months ago

Sorry I meant {"Time": -1} in open_mfdataset will chunk so that all timestamps in one file are in one chunk. That is going to be a monthly frequency chunking which will work nicely for the climatologies.

is that it's always best to first convert the many netcdf files into a {"Time": -1} zarr collection for many people to compute against many times

Well yes but then you have a different problem. But yeah if you get this then even medians and quantiles will work, and will work quite well.

It sounds like my next step is figuring out the best possible way to get that very large {"Time": -1} zarr collection written from the 11TB's across 366 netcdfs without blowing up the cluster?

https://rechunker.readthedocs.io/en/latest/

Thomas-Moore-Creative commented 4 months ago

Thanks again @dcherian - have been using rechunker and trying to tune that for the 11TB float32 objects, this specific problem, and our clusters.

So approach sounds like:

{"Time": -1} in open_mfdataset yielding monthly frequency chunking direct into flox enabled ds.groupby("Time.month") for mean,std,min, &max
a rechunker chunked {"Time": -1} aka {'Time':11138} zarr collection to load into flox enabled ds.groupby("Time.month") for median & quantile ( method would be "blockwise"?)

Lastly, by

Well yes but then you have a different problem.

I assume you are saying that a {"Time": -1} aka {'Time':11138} chunked array is poorly optimised for the ds.groupby("Time.month").mean() problem and will end up with many more tasks and smaller output chunks than is needed?

dcherian commented 4 months ago

I assume you are saying that a {"Time": -1} aka {'Time':11138} chunked array is poorly optimised for the ds.groupby("Time.month").mean() problem and will end up with many more tasks and smaller output chunks than is needed?

Not at all. It will work very well indeed. Just saying that the need for a rechunked array isn't in your "Objective" listed above.

Thomas-Moore-Creative commented 4 months ago

all-time base stats completed for 11TB 3D @ daily variables

Cluster: MegaMem ncpus=48 mem=2990GB Walltime: approx 4 hours per variable Approach: Via PBS batch >> intake search >> lazy load with xarray_open_kwargs = {"chunks": {"Time": -1,'st_ocean':10}} >> .groupby().mean() w/ engine='flox',method="cohorts",skipna=False >> write to netcdf

( "base" stats doesn't include median or quantile )

dcherian commented 4 months ago

With that chunking quantile with flox will work quite well too. I recommend just adding q=0.5 to get the median at the same time. Flox will calculate grouped quantiles for all quantiles in a vectorized manner

Thomas-Moore-Creative commented 4 months ago

With that chunking quantile with flox will work quite well too. I recommend just adding q=0.5 to get the median at the same time. Flox will calculate grouped quantiles for all quantiles in a vectorized manner

Thanks for those recommendations!

I have a 3D @ daily variable chunked "for all-time" in an ARD zarr collection - so I'll test your advice above with the {"Time":[31,28,31,30 . . .]} vs. using that {"Time":11138} lazy object. [ EDIT FOR CLARITY ]

Thomas-Moore-Creative commented 4 months ago

Flox will calculate grouped quantiles for all quantiles in a vectorized manner

If I specify engine="flox" I get GroupBy.quantile() got an unexpected keyword argument 'engine'. I'm probably misunderstanding how this works (again). So since I have flox imported "Flox will calculate grouped quantiles for all quantiles in a vectorized manner" even without the engine="flox" argument?

Thomas-Moore-Creative commented 4 months ago

@dcherian - further when I try to run (with flox imported and the option to use in xr set to True) the following:

quant = ds.groupby(time_dim+'.month').quantile(q_list,skipna=skipna_flag,dim=time_dim)

I get this error: ValueError: Aggregation quantile is only supported for method='blockwise', but the chunking is not right. ???

I print out the chunking right before I call the above:

importing functions ...
Spinning up a dask cluster...
<Client: 'tcp://127.0.0.1:42425' processes=48 threads=48, memory=2.91 TiB>
variable requested: salt
Time chunks: (31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30
, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30
, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30
, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31
, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31
, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31
, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30)
st_ocean chunks: (10, 10, 10, 10, 10, 1)
yt_ocean chunks: (300, 300, 300, 300, 300)
xt_ocean chunks: (300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300)

dcherian commented 4 months ago

So since I have flox imported "Flox will calculate grouped quantiles for all quantiles in a vectorized manner" even without the engine="flox" argument?

o_O nice find. But yes you just need it installed and xr.set_options(use_flox=True). I think the top level quantile doesn't take arbitrary kwargs yet, so it complains.

quant = ds.groupby(time_dim+'.month').quantile(q_list,skipna=skipna_flag,dim=time_dim)

Another aside, it's nice to save the result of ds.groupby(time_dim+'.month') like so:

gb = ds.groupby(time_dim + '.month')
quant = gb.quantile(...)

This will save some repeated effort in setting up the groupby problem.

get this error: ValueError: Aggregation quantile is only supported for method='blockwise', but the chunking is not right.

Hmmm.. potentially a bug.

dcherian commented 4 months ago

Oh right this won't work.

Quantiles work by sorting (really, paritioning the data) so you need all time points for each month in a single chunk.

Since you have a time-rechunked dataset, use that as an input.

Thomas-Moore-Creative commented 4 months ago

Another aside, it's nice to save the result of ds.groupby(time_dim+'.month') like so:
gb = ds.groupby(time_dim + '.month')
quant = gb.quantile(...)
This will save some repeated effort in setting up the groupby problem.

Nice tip, thank you. Always keen to shave off walltime.

Thomas-Moore-Creative commented 4 months ago

Oh right this won't work.

Quantiles work by sorting (really, paritioning the data) so you need all time points for each month in a single chunk.

Since you have a time-rechunked dataset, use that as an input.

Thanks for thinking this through. So, looks like I need to fix the problems with my rechunker code to get chunk:"all-time" zarr's for all my 3D variables.

It has been really useful to learn through this discussion that chunk:"all-time" isn't required for fast floxy aggregations like mean,std,min,max. My code to generate these stats is finishing in about 3 hours for an 11TB 3D & time float32 variable on 28 x 8GB ( and possibly could drop some of that memory and reduce resources used )

dcherian commented 4 months ago

Yes everything but quantiles. We could do an approximate_quantile with that chunking.

See https://github.com/xarray-contrib/flox/pull/284/files if you want to test that out.

Thomas-Moore-Creative commented 3 months ago

what we learned here has been factored into new code

Thomas-Moore-Creative / Climatology-generator-demo