ActivitySim / activitysim

An Open Platform for Activity-Based Travel Modeling
https://activitysim.github.io
BSD 3-Clause "New" or "Revised" License
190 stars 99 forks source link

Update to use pandas 2.x #838

Open jpn-- opened 6 months ago

jpn-- commented 6 months ago

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

jpn-- commented 6 months ago

While I've made these updates and all the regular CI tests pass (i.e. the results look correct), I have discovered the change to pandas 2.x incurs a significant runtime penalty when running without sharrow.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

non-sharrow test timings for pandas 2.x:

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

It will require some research to figure out why this is happening, and whether it can be solved relatively easily... or at all. Initial profiling suggests the problem is in pandas.core.internals.managers.BlockManager.get_dtypes, which is getting called from df.eval, but we almost certainly do not want to mess around with pandas internals.