casangi / xradio

Xarray Radio Astronomy Data IO
Other
9 stars 5 forks source link

Fix memory issue by using `dask.config.set(scheduler="synchronous")` … #149

Closed Jan-Willem closed 4 months ago

Jan-Willem commented 4 months ago

The purpose of xradio._utils.zarr.common._load_no_dask_zarr is to load a processing set without using dask so that it can be used within a function that gets dask.delayed. However, when a slice is selected that spans multiple chunks on disk the function consumed considerably more memory compared to using xarray.open_zarr.

Example: If I have a zarr array on disk with dimensions (816, 1275, 3840, 2) and chunking (816, 1275, 200, 2) the following call will load the entire array into memory and then do the subselection:

import zarr
array=zarr.Array(store=store)
sliced_array = array[0:10,:,0:800,:]

The plot of the memory consumption:

memory_problem

This can be fixed by using

xds = xarray.open_zarr(store)
with dask.config.set(scheduler="synchronous"):
     xds = xds.load()
ideal

Using dask.config.set(scheduler="synchronous") forces .load to make use of a single thread and no dask graph is created (https://docs.dask.org/en/stable/scheduler-overview.html#debugging-the-schedulers).

This solution was suggested in the first comment of https://github.com/pydata/xarray/issues/3386 .

src/xradio/_utils/zarr/common.py:

src/xradio/vis/load_processing_set.py:

src/xradio/vis/read_processing_set.py: