ActivitySim / activitysim

An Open Platform for Activity-Based Travel Modeling
https://activitysim.github.io
BSD 3-Clause "New" or "Revised" License
189 stars 96 forks source link

New Data Pipeline File Format #645

Open jpn-- opened 1 year ago

jpn-- commented 1 year ago

CS will change ActivitySim’s default data storage option for tabular data from HDF5 to Parquet. CS will benchmark associated ActivitySim performance improvement by documenting the resulting changes in file sizes and read/write times, using the MTC full scale example model. Consultant will ensure that users will continue to be able to use HDF5 files if they want via a switch in the top-level settings, and will also enable HDF data compression options.

stefancoe commented 1 year ago

The release of Pandas 2.0 includes Apache Arrow as an optional backend and it will automatically be used when reading from parquet files. This could result in performance improvements, including reduced RAM usage as string columns will use the Apache Arrow string type instead of the Python object type (If I am understanding things correctly). @jpn-- , @i-am-sijia any thoughts on these impacts? Perhaps we can touch on this at an upcoming Consortium meeting if it seems like improvements could be non-trivial.

https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/ https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i