azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Evaluate storing each feature_id as column in parquet #84

Closed vlulla closed 2 years ago

vlulla commented 2 years ago

Currently we are converting NWM zarr into parquet which is stored with feature_id and time as columns. One of the lessons Terence learned from Rich Signell at the ESIP conference was that Rich was able to get incredible performance by saving feature_id as a column. So, basically we are trying to evaluate if instead of saving like this:

feature_id time val
feat1 time1 val_1_1
feat1 time2 val_1_2
feat1 time3 val_1_3
feat1 time4 val_1_4
feat2 time1 val_2_1
feat2 time2 val_2_2
feat2 time3 val_2_3
feat2 time4 val_2_4

would it be better to save it like this:

time feat1_val feat2_val
time1 val_1_1 val_2_1
time2 val_1_2 val_2_2
time3 val_1_3 val_2_3
time4 val_1_4 val_2_4

Since there are about 2.7e6 feature_ids we are not sure whether this will be a problem with parquet. This will have to be investigated a bit more.

This is essentially going from long to wide table translation (or pandas.pivot especially see the examples)

vlulla commented 2 years ago

The comment https://github.com/azavea/noaa-hydro-data/issues/89#issuecomment-1218235862 includes a workaround of how we can convert from long to wide parquet. It is still not clear how we would be able to convert the complete data set (2.7e6 feature_ids) from zarr to parquet in wide format.