NOAA-PSL / model_error_correction_with_ai

0 stars 2 forks source link

enable parallel IO for training data #35

Open frolovsa opened 10 months ago

frolovsa commented 10 months ago

Currently, we use numpy arrays to store pre-formatted training data. For example as here: https://github.com/NOAA-PSL/model_error_correction_with_ai/blob/dd0ce349c1e235a30ca3eb81848a36fde7ea8d07/training.py#L292

Unfortunately, that read statement: 1) doesnt work well with reading slices of data. 2) It is sequential and hence does not scale as the size of the dataset grows.

So suggestion/hypothesis is that if we store data as a zarr archive we can use parallel io capabilities and read slices of data more effectively. To test this, lets modify this prep code below to store a zarr archive instead of a numpy archive.

This numpy array is prepared with this program https://github.com/NOAA-PSL/model_error_correction_with_ai/blob/main/preprocess.py

The request is to: 1) copy preprocess.py as preprocess_zarr.py 2) Modify preprocess_zarr.py to output zarr archive instead of numpy archive.

The test data is available on hera. this install of python should have xarray dependencies available. /scratch2/BMC/gsienkf/Sergey.Frolov/anaconda3/bin/python

frolovsa commented 10 months ago

@timothyas I am asking Pete to look into this problem. Do you have any pointers from a technical perspective?

timothyas commented 10 months ago

My one recommendation at this stage is to use xarray as an interface to zarr/dask/numpy. It is fairly intuitive to use in my opinion, and you don't necessarily have to be an expert to interface with all of them. An example of its usage could be something like

xds = xarray.from_zarr("/path/to/my-data.zarr") # automatically reads the data lazily, so does not pull it into memory
xds = xds.isel(time=idx) # selects the data based on the dimension labelled "time", as an example
xds["my_variable"] = xds["my_variable"].data.persist() # using .data accesses the underlying dask array

Here's a quick reference to dask's persist Here's the reference for Dask Arrays, which under the hood is how xarray accesses data lazily for out-of-memory operations