Fix memory issue by using `dask.config.set(scheduler="synchronous")` …

The purpose of xradio._utils.zarr.common._load_no_dask_zarr is to load a processing set without using dask so that it can be used within a function that gets dask.delayed. However, when a slice is selected that spans multiple chunks on disk the function consumed considerably more memory compared to using xarray.open_zarr.

Example: If I have a zarr array on disk with dimensions (816, 1275, 3840, 2) and chunking (816, 1275, 200, 2) the following call will load the entire array into memory and then do the subselection:

import zarr
array=zarr.Array(store=store)
sliced_array = array[0:10,:,0:800,:]

The plot of the memory consumption:

This can be fixed by using

xds = xarray.open_zarr(store)
with dask.config.set(scheduler="synchronous"):
     xds = xds.load()

Using dask.config.set(scheduler="synchronous") forces .load to make use of a single thread and no dask graph is created (https://docs.dask.org/en/stable/scheduler-overview.html#debugging-the-schedulers).

This solution was suggested in the first comment of https://github.com/pydata/xarray/issues/3386 .

src/xradio/_utils/zarr/common.py:

Contains a new function _open_dataset(store, xds_isel=None, data_variables=None, load=False) that is used by both read_processing_set and load_processing_set.

src/xradio/vis/load_processing_set.py:

Uses _open_dataset
load_processing_set now allows for specifying which data_variables to load and can be set not to load the sub_datasets (weather_xds, pointing_xds, etc.).

src/xradio/vis/read_processing_set.py:

Uses _open_dataset

casangi / xradio

Fix memory issue by using `dask.config.set(scheduler="synchronous")` … #149