NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the nex-gddp-cmip6 dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of .nc files, and converts that dataset to Zarr, a more modern, cloud-optimized format.
This is using xr.open_mfdataset(..., parallel=True) which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.
One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.
NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the
nex-gddp-cmip6
dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of.nc
files, and converts that dataset to Zarr, a more modern, cloud-optimized format.This is using
xr.open_mfdataset(..., parallel=True)
which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.
EDIT: Here's a cluster link for the "small" version of this test https://cloud.coiled.io/clusters/594106/account/dask-engineering/information. It takes ~20 minutes and costs ~$0.75