Wanted script to extract one or more HUCs from NWM short-range prediction NetCDF format and append to existing .parquet dataset.
I think the hard part of this issue is to figure out if/how it's possible to append to Parquet file from python and what the schema for the streams file thats friendly to appending should be.
Assumptions:
clip_nwm.py will at some point be called in AWS Batch or similar with required parameters to extract multiple HUCs
Because NetCDF file has only one prediction window, it makes sense to read it once and extract multiple HUCs instead of calling it once per HUC
We want to append to Parquet file at least on RowGroup Level
Questions:
What is the layout of the Parquet file?
Is it actually possible to append to Parquet from python libraries?
Should we be trying to append at all?
How big is a Parquet file for a Delaware basin? (Maybe its large enough that we don't want to grow it)
Notes:
It is not clear its easy to append to Parquet files.
Lots of SO examples talk about re-reading and re-writing the file. That's not an option because we expect to read
It appears possible based on this Java implementation: https://github.com/apache/parquet-mr/pull/278
It does not appear possible to do this using PyArrow
Wanted script to extract one or more HUCs from NWM short-range prediction NetCDF format and append to existing .parquet dataset.
I think the hard part of this issue is to figure out if/how it's possible to append to Parquet file from python and what the schema for the streams file thats friendly to appending should be.
Assumptions:
clip_nwm.py
will at some point be called in AWS Batch or similar with required parameters to extract multiple HUCsQuestions:
Notes: It is not clear its easy to append to Parquet files. Lots of SO examples talk about re-reading and re-writing the file. That's not an option because we expect to read It appears possible based on this Java implementation: https://github.com/apache/parquet-mr/pull/278 It does not appear possible to do this using PyArrow