clessig / atmorep

AtmoRep model code
MIT License
44 stars 11 forks source link

move to Dask array + first working version of the multiformer #37

Closed iluise closed 1 month ago

iluise commented 2 months ago

adding the following features:

1) Dask arrays in multifield_dala_sampler 2) faster way to retrieve the normalizers when lats-lons are in increasing order

3) fix embed/embeds mismatch when loading checkpoints 4) running multiformer configuration (multiples changes in train_multi.py), also from single field checkpoints 5) a bit of cleaning

kacpnowak commented 2 months ago

Would it be possible to use xarray to open zarr files? It works much better with timestamps and it's compatible with dask

clessig commented 2 months ago

Would it be possible to use xarray to open zarr files? It works much better with timestamps and it's compatible with dask

Could you reply a clear example where the benefit would be?

Dask is significantly faster than the native python zarr library. What is the performance of xarray?

kacpnowak commented 2 months ago

I didn't mean to use xarray instead of dask. I mean using it: here

self.ds = zarr.open( file_path)

In case of xarray it's just:

self.ds = xr.open_zarr( file_path)

In such case for datetime type casting becomes uncessary, eg. self.year_base = self.ds['time'][0].astype(datetime) as one can simply use self.ds.index['time'][0]

Additionally it better decodes timestamps. In case of FESOM data when timestamps are in daily resolution, I had to manually do conversion from int. Xarray figured it out on it's own

clessig commented 2 months ago

for datetime type casting becomes uncessary, eg. self.year_base = self.ds['time'][0].astype(datetime) as one can simply use self.ds.index['time'][0]

Additionally it better decodes timestamps. In case of FESOM data when timestamps are in daily resolution, I had to manually do conversion from int. Xarray figured it out on it's own

But by using xarray we would lose the performance advantage of dask, I assume. I would want to see that it's performance neutral before considering to use xarray instead of dask.

mlangguth89 commented 2 months ago

xarray has a bulit-in support for dask: https://docs.xarray.dev/en/latest/user-guide/dask.html

kacpnowak commented 2 months ago

@clessig There's no performance penalty for using Xarray. here is quite extensive comparison between zarr and xarray performance: https://github.com/pydata/xarray/issues/9111#issuecomment-2164233488

clessig commented 2 months ago

The question is wrt to dask (that might parallelize things etc) and not the native zarr library. Could you run with dask and xarray and num_worker_loaders=0 and see how many sampler/sec we get. Thanks!

kacpnowak commented 2 months ago

I'm on vacation so I can do it next week

clessig commented 1 month ago

Where do we stand here? Ready to merge?

@kacpnowak : did you check the performance of xarray?

kacpnowak commented 1 month ago

So far it's too difficult to make xarray only version. On developer meeting with @mlangguth89 we've talked about doing during hackaton in November, but I think it would be issue/branch