coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Add benchmark for NetCDF --> Zarr cloud-optimization #1551

Closed jrbourbeau closed 1 month ago

jrbourbeau commented 1 month ago

NetCDF datasets being slow/not scaling well has come up a lot. This PR adds a new benchmark that loads the nex-gddp-cmip6 dataset (https://registry.opendata.aws/nex-gddp-cmip6/) from AWS, which is stored as a bunch of .nc files, and converts that dataset to Zarr, a more modern, cloud-optimized format.

This is using xr.open_mfdataset(..., parallel=True) which is both common and really slow when opening lots of NetCDF files, which I like because I've seen this with many users in practice.

One thing I'm not sure about is how representative this benchmark is as is. I don't know if folks do this NetCDF --> Zarr conversion in isolation, or always in conjunction with other "cloud optimizing" steps like rechunking.

EDIT: Here's a cluster link for the "small" version of this test https://cloud.coiled.io/clusters/594106/account/dask-engineering/information. It takes ~20 minutes and costs ~$0.75