azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Create narrow Parquet file for restrospective analysis data #98

Closed jpolchlo closed 1 year ago

jpolchlo commented 2 years ago

Following the failure of #96 and #97, we'd like to aggregate the channel routing data from the historical retrospective data set into a single Parquet file. This file should use row grouping to enable speedy access of only the reach IDs that are relevant. (Here is a useful SO post regarding filtering by row groups that might be useful.)

This conversion can be done in whichever environment (Dask, Spark, etc).

rajadain commented 1 year ago

When we make a narrow file, we lose the advantages of columnar sorting, thus it doesn't give us the performance advantages that we may otherwise have gotten.

jpolchlo commented 1 year ago

As a slight clarification for posterity: I think that the use of row groups mentioned in the issue text can provide the speed you're looking for. Depending on how many ways you want to index the files, this may lead to a huge number of constituent files for the stored Parquet, which can have implications for the cost of creation of the resource. It's also clear from benchmarks that Python-ecosystem parquet libraries are not as optimized as Zarr. However, there's not a specific technical limitation for narrow Parquet files that make them unable to access a subset of data in a performant fashion.

But, I'm OK with closing this, since there's not exactly a drum-beat of voices clamoring for more Parquet.