Closed iluise closed 1 month ago
Would it be possible to use xarray to open zarr files? It works much better with timestamps and it's compatible with dask
Would it be possible to use xarray to open zarr files? It works much better with timestamps and it's compatible with dask
Could you reply a clear example where the benefit would be?
Dask is significantly faster than the native python zarr library. What is the performance of xarray?
I didn't mean to use xarray instead of dask. I mean using it: here
self.ds = zarr.open( file_path)
In case of xarray it's just:
self.ds = xr.open_zarr( file_path)
In such case for datetime type casting becomes uncessary, eg. self.year_base = self.ds['time'][0].astype(datetime)
as one can simply use self.ds.index['time'][0]
Additionally it better decodes timestamps. In case of FESOM data when timestamps are in daily resolution, I had to manually do conversion from int. Xarray figured it out on it's own
for datetime type casting becomes uncessary, eg.
self.year_base = self.ds['time'][0].astype(datetime)
as one can simply useself.ds.index['time'][0]
Additionally it better decodes timestamps. In case of FESOM data when timestamps are in daily resolution, I had to manually do conversion from int. Xarray figured it out on it's own
But by using xarray we would lose the performance advantage of dask, I assume. I would want to see that it's performance neutral before considering to use xarray instead of dask.
xarray has a bulit-in support for dask: https://docs.xarray.dev/en/latest/user-guide/dask.html
@clessig There's no performance penalty for using Xarray. here is quite extensive comparison between zarr and xarray performance: https://github.com/pydata/xarray/issues/9111#issuecomment-2164233488
The question is wrt to dask (that might parallelize things etc) and not the native zarr library. Could you run with dask and xarray and num_worker_loaders=0 and see how many sampler/sec we get. Thanks!
I'm on vacation so I can do it next week
Where do we stand here? Ready to merge?
@kacpnowak : did you check the performance of xarray?
So far it's too difficult to make xarray only version. On developer meeting with @mlangguth89 we've talked about doing during hackaton in November, but I think it would be issue/branch
adding the following features:
1) Dask arrays in
multifield_dala_sampler
2) faster way to retrieve thenormalizers
when lats-lons are in increasing order3) fix embed/embeds mismatch when loading checkpoints 4) running multiformer configuration (multiples changes in
train_multi.py
), also from single field checkpoints 5) a bit of cleaning