Closed abhishek0208 closed 3 years ago
Nice @abhishek0208! Did you investigate pd.CategoricalIndex
? It might be worth investigating casting your indices to this type to obtain extra savings. I believe the CategoricalIndex
is understood by parquet and further reduces duplication of data. The duplicates are stored in a lookup table, and referenced by an integer, rather like you would in a normalised database.
Also, any reason not to use feather versus parquet?
Thanks @willu47, I'll look into incorporating pd.CategoricalIndex
. We're already getting a pretty significant reduction in file sizes but it's worth maximising it for the long term. I'll also work on filtering out technologies and commodities to further reduce file sizes. Just wanted to get this up and running for our immediate use.
And I did a quick search of performance benchmarks between feather and pyarrow: According to this, pyarrow (parquet) writes out smaller files while taking longer to read them in. So I went with parquet since file size is our more pressing concern?
I got a different take-away:
“As our little test shows, it seems that feather format is an ideal candidate to store the data between Jupyter sessions. It shows high I/O speed, doesn’t take too much memory on the disk and doesn’t need any unpacking when loaded back…” — Ilia Zaitsev https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d?source=social.tw
Be interesting to see if there's a difference between the two, but as you say - 27 MB over 400 MB is a great improvement.
Aggregated results files are now in parquet (pyarrow) format to reduce file sizes. Initial tests show a reduction from around 400Mb to 27 Mb for aggregated results of a 10-replicate set of model runs.