ecmwf-lab / ecml-tools

Apache License 2.0
7 stars 1 forks source link

Why did you chose the Zarr format? #1

Open skaae opened 8 months ago

skaae commented 8 months ago

Hi,

I'm writing data loader for loading GRIB weather data and found this project while browsing github. I'm currently considering what format to use in the dataloader. I hope you have time to explain some of your design choices? I need to download data in GRIB format from either ECMWF, GFS or HRRR and store it in a format that ca be used for ML.

For the dataloader I experimented with storing the data as:

Zarr is easy to load and compatible with Xarray but was also way bigger than the original grib files? Currently I'm leaning towards storing the data as an individual GRIB files for each field because it requires the least amount of diskspace. Maybe you could shed some light on why you choose the Zarr format? Is it because it's fast to load or is it to stay compatible with WeatherBench2?

b8raoult commented 8 months ago

Because that format fits our need to run 100 epochs over multi-terabytes datasets for training a weather forecasting model. Each chunk is on date/time will all the variables. Our datasets range between 7TB and 70TB.