Improve Parquet results from ESIP

lewfish commented 2 years ago

As part of this we should re-run the benchmarks in https://github.com/azavea/noaa-hydro-data/blob/master/src/esip-2022-presentation/benchmark_queries.ipynb with the wide format dataset Vijay created at s3://azavea-noaa-hydro-data/esip-experiments/datasets/reanalysis-chrtout/parquet/vl/wide-parquets-all-feature_ids/

We should also try other ways of formatting it if we can't get the numbers close to the Zarr ones.

vlulla commented 2 years ago

I ran the benchmark, using a separate notebook (s3://noaa-notebooks/vlulla/benchmark-using-wide-parquet.ipynb), for the wide parquet and found that using a wide parquet did not really improve the numbers. The input/output for/from the above notebook are here:

Wide parquet dataset: s3://azavea-noaa-hydro-data/esip-experiments/datasets/reanalysis-chrtout/parquet/vl/wide-parquets-all-feature_ids/streamflow-1990-1999-consolidated-wide.parquet
Benchmark results: s3://azavea-noaa-hydro-data/esip-experiments/benchmarks/vl/08-22-2022-with-wide-parquet.csv
Plot of the results: s3://azavea-noaa-hydro-data/esip-experiments/plots/parquet/vl/08-22-2022-wide-parquets/parquet.png

rajadain commented 1 year ago

Not pursuing this at this time.

azavea / noaa-hydro-data

Improve Parquet results from ESIP #94