Open frolovsa opened 1 year ago
@timothyas I am asking Pete to look into this problem. Do you have any pointers from a technical perspective?
My one recommendation at this stage is to use xarray as an interface to zarr/dask/numpy. It is fairly intuitive to use in my opinion, and you don't necessarily have to be an expert to interface with all of them. An example of its usage could be something like
xds = xarray.from_zarr("/path/to/my-data.zarr") # automatically reads the data lazily, so does not pull it into memory
xds = xds.isel(time=idx) # selects the data based on the dimension labelled "time", as an example
xds["my_variable"] = xds["my_variable"].data.persist() # using .data accesses the underlying dask array
Here's a quick reference to dask's persist Here's the reference for Dask Arrays, which under the hood is how xarray accesses data lazily for out-of-memory operations
Currently, we use numpy arrays to store pre-formatted training data. For example as here: https://github.com/NOAA-PSL/model_error_correction_with_ai/blob/dd0ce349c1e235a30ca3eb81848a36fde7ea8d07/training.py#L292
Unfortunately, that read statement: 1) doesnt work well with reading slices of data. 2) It is sequential and hence does not scale as the size of the dataset grows.
So suggestion/hypothesis is that if we store data as a zarr archive we can use parallel io capabilities and read slices of data more effectively. To test this, lets modify this prep code below to store a zarr archive instead of a numpy archive.
This numpy array is prepared with this program https://github.com/NOAA-PSL/model_error_correction_with_ai/blob/main/preprocess.py
The request is to: 1) copy preprocess.py as preprocess_zarr.py 2) Modify preprocess_zarr.py to output zarr archive instead of numpy archive.
The test data is available on hera. this install of python should have xarray dependencies available. /scratch2/BMC/gsienkf/Sergey.Frolov/anaconda3/bin/python