arup-group / genet

Manipulate MATSim networks via a Python API.
MIT License
44 stars 9 forks source link

Significant increase in time to build `SpatialTree` #199

Closed KasiaKoz closed 1 year ago

KasiaKoz commented 1 year ago

There is a significant increase in time to build SpatialTree.

The command intermodal-access-egress-network with quite detailed Londinium test network spends a long time at the step:

... - Building Spatial Tree

Runtimes increate from <1min to ~25min. This test network, while detailed, is still a drop in an ocean compared to what our usual networks look like. This kind of increase in time is prohibitive for us.

Pre python 3.11, it took:

38.46s user; 51.189 total

using commit f112d7c8de52cfd31f1d8623d6429cf2f414dbf3. It now takes:

1483.67s user; 25:26.03 total
KasiaKoz commented 1 year ago

Found the root of the problem, the operation pd.DataFrame().T.to_dict() is to blame, in particular, in the version of pandas we're using now it takes a silly amount of time to make a wide DataFrame to a dictionary (but long is actually faster than before).

Below I also test a 'custom rearrange' from a long form .to_dict() output (d) to wide form with just a dict comprehension

def _long_wide_dict(d):
    return {row_idx: {col: row_val} for col, col_val in d.items() for row_idx, row_val in col_val.items()}

Here are the times:

Python 3.7 + pandas==1.3.5

DataFrame shape: (50000, 20)
it took 0.00048732757568359375s to `.T`
it took 0.5953989028930664s to `.to_dict()`
it took 2.153273105621338s to `.T.to_dict()`
it took 0.7090747356414795s to custom rearrange `.to_dict()`

Python 3.11 + pandas==2.1.1

DataFrame shape: (50000, 20)
it took 0.0002789497375488281s to `.T`
it took 0.15897417068481445s to `.to_dict()`
it took 41.112441062927246s to `.T.to_dict()`
it took 0.32709574699401855s to custom rearrange `.to_dict()`
brynpickering commented 1 year ago

Quite an impact! Could you try with pandas 2.0.3? Pandas 2.1 has some regressions that impact runtimes.

KasiaKoz commented 1 year ago

Nice one @brynpickering

Python 3.11 + pandas==2.0.3

DataFrame shape: (50000, 20)
it took 0.00023674964904785156s to `.T`
it took 0.12676477432250977s to `.to_dict()`
it took 1.0371029376983643s to `.T.to_dict()`
it took 0.24923920631408691s to custom rearrange `.to_dict()`