Output files in parquet (pyarrow) format

ClimateCompatibleGrowth / gui_workflow

The snakemake workflow for the Gulf UnderSea Interconnector feasibility study

MIT License

0 stars 0 forks source link

Output files in parquet (pyarrow) format #23

Closed abhishek0208 closed 3 years ago

abhishek0208 commented 3 years ago

Aggregated results files are now in parquet (pyarrow) format to reduce file sizes. Initial tests show a reduction from around 400Mb to 27 Mb for aggregated results of a 10-replicate set of model runs.

willu47 commented 3 years ago

Nice @abhishek0208! Did you investigate pd.CategoricalIndex? It might be worth investigating casting your indices to this type to obtain extra savings. I believe the CategoricalIndex is understood by parquet and further reduces duplication of data. The duplicates are stored in a lookup table, and referenced by an integer, rather like you would in a normalised database.

willu47 commented 3 years ago

Also, any reason not to use feather versus parquet?

abhishek0208 commented 3 years ago

Thanks @willu47, I'll look into incorporating pd.CategoricalIndex. We're already getting a pretty significant reduction in file sizes but it's worth maximising it for the long term. I'll also work on filtering out technologies and commodities to further reduce file sizes. Just wanted to get this up and running for our immediate use.

abhishek0208 commented 3 years ago

And I did a quick search of performance benchmarks between feather and pyarrow: According to this, pyarrow (parquet) writes out smaller files while taking longer to read them in. So I went with parquet since file size is our more pressing concern?

willu47 commented 3 years ago

I got a different take-away:

“As our little test shows, it seems that feather format is an ideal candidate to store the data between Jupyter sessions. It shows high I/O speed, doesn’t take too much memory on the disk and doesn’t need any unpacking when loaded back…” — Ilia Zaitsev https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d?source=social.tw

Be interesting to see if there's a difference between the two, but as you say - 27 MB over 400 MB is a great improvement.