Transport-for-the-North / caf.toolkit

Toolkit of transport planning and appraisal functionalities.
https://caftoolkit.readthedocs.io/en/stable/
Other
0 stars 2 forks source link

Optimising the slow matrix translation methods #79

Closed BenTaylor-TfN closed 1 year ago

BenTaylor-TfN commented 1 year ago

pandas_multi_vector_zone_translation() is doing the same thing as half of the pandas_matrix_zone_translation() when it falls back to the slow method. This should be consolidated.

Furthermore, pandas_multi_vector_zone_translation() is faster than pandas_matrix_zone_translation() when the latter needs to fall back to it's slow method. This should be optimised.

The initial code

    import pandas as pd
    from caf.toolkit import timing
    from caf.toolkit import translation

    # Read in data
    lookup = pd.read_csv(r"I:\Data\Zone Translations\cache\noham_ntem\noham_to_ntem_pop_2018.csv")
    mat = pd.read_csv(r"I:\NorMITs Demand\Forecasting\NTEM\8.0\iter3.1\Core\car_and_passenger\Matrices\OD\hb_od_from_yr2028_p1_m3_tp1.csv.bz2", index_col=0)

    # pandas_matrix_zone_translation() method
    start = timing.current_milli_time()
    trans = translation.pandas_matrix_zone_translation(
        matrix=mat,
        translation=lookup,
        translation_from_col='noham_id',
        translation_to_col='ntem_id',
        translation_factors_col='noham_to_ntem',
        from_unique_index=lookup['noham_id'].unique(),
        to_unique_index=lookup['ntem_id'].unique()
    )
    end = timing.current_milli_time()
    print(f"matrix took: {end - start}ms")

    # pandas_multi_vector_zone_translation() method
    start = timing.current_milli_time()
    half_done = translation.pandas_multi_vector_zone_translation(
        vector=mat,
        translation=lookup,
        translation_from_col='noham_id',
        translation_to_col='ntem_id',
        translation_factors_col='noham_to_ntem',
        from_unique_index=lookup['noham_id'].unique(),
        to_unique_index=lookup['ntem_id'].unique(),
        check_totals=False,
    )
    done = translation.pandas_multi_vector_zone_translation(
        vector=half_done.transpose(),
        translation=lookup,
        translation_from_col='noham_id',
        translation_to_col='ntem_id',
        translation_factors_col='noham_to_ntem',
        from_unique_index=lookup['noham_id'].unique(),
        to_unique_index=lookup['ntem_id'].unique(),
        check_totals=False,
    ).transpose()
    end = timing.current_milli_time()
    print(f"multi-vector took: {end - start}ms")

Run results

The above code results in the following.

It's worth noting a few things:

Finally, I've known this is an issue for a while and never got round to optimising the "slow" translation. I think this is the route we should now take.

Further tests

I think there's a few things we should investigate further here:

Furthermore, it's clear the "slow" method needs optimising. I see two options:

  1. Just replace it with the multi-vector function. Although, this is written in pandas so it will need updating for numpy data.
  2. Investigate numba / cython optimisations for the current code. It's slow because it's a for loop.

The two options should all be tested with the above tests as well to make sure we end up with the optimal version of code.

I'll open a new Issue for this issue specifically.

BenTaylor-TfN commented 1 year ago

Had a look at the numba optimisations today and it's not good news! Code can be found here: https://github.com/Transport-for-the-North/caf.toolkit/tree/numba-slow-translation

Bottom line, numba isn't well optimised for broadcasting, so while the code is faster, it's still >1min.

Feel the best option here might be to trust pandas optimisation and use that when we fall back to the slow method. Why re-invent the wheel?

isaac-tfn commented 1 year ago

Feel the best option here might be to trust pandas optimisation and use that when we fall back to the slow method. Why re-invent the wheel?

Agreed. I think pandas is generally going to be optimised and tested more extensively than anything we'll come up with in a reasonable time frame.