Read-in Performance - Githubissues

regDaniel commented 1 year ago

This issue is more a documentation for us. We try to optimize the read-in with @clairemerker

Some timings:

cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', encode_cf=("time", "geography", "vertical")) ~ 270 s
cfgrib.open_datasets(engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~ 40 s
da =xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer"}, }, encode_cf=("time", "geography", "vertical")) ~4-5 s (lazy loading, with subsequent da.load() 80s)
da =xr.open_dataset(filelist[0], engine="cfgrib", backend_kwargs={'indexpath': '', 'errors': 'ignore', "filter_by_keys": {"typeOfLevel": "generalVerticalLayer", "short_name":"T"}, }, encode_cf=("time", "geography", "vertical")) ~8 s

more timings with Dask (open 10 icon forecast files and extract one variable):

single core: ~80-90 s
chunking vertical layers (Chunksize=1, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 30 Workers): 45 s
interestingly a list comprehension with xr.open_dataset followed by a xr.concat is ~5-10% faster than xr.open_mfdataset.

when first merging files with cat:

chunking vertical layers (Chunksize=1, 20 Workers): 40 s
chunking vertical layers (Chunksize=2, 20 Workers): 40 s
--> the cfgrib overhead doesn't really vanish here.

All timings were tested on Tsa reading from /store, reading from /scratch reduces read-in times by approximately 10%.

regDaniel commented 1 year ago

I think, we gained some more experience with this during the development of icon_timeseries. Can we close this one @clairemerker or do you think it is still relevant for iconarray? If yes, I should probably update the timings.

clairemerker commented 1 year ago

In a sense the issue is still relevant, @victoria-cherkas and I will write a new version of open_dataset() for iconarray based on what we learned in icon-timeseries. No need to update the timings in my opinion, but maybe keep the issue open, we can close it after the new implementation.

C2SM / iconarray

Read-in Performance #35