azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Workaround for saving converting data to Parquet #81

Closed lewfish closed 2 years ago

lewfish commented 2 years ago

We have had difficulty converting the NWM subset from Zarr to Parquet in parallel using xArray and Dask. See https://github.com/pydata/xarray/issues/6811 and https://dask.discourse.group/t/workers-dont-have-promised-key-error-and-delayed-computation/936. If we cannot resolve our problems by increasing our understanding of Dask by reading documentation and forum posts, we should find a workaround. One potential workaround is to use AWS Batch to run a script in parallel (using the job array functionality in Batch), where each job selects a specific piece of the datasets and saves it as a Parquet file.