performance improve of stop unification for frequency aggregation

Problem

Processing of frequency aggregation is slow, taking 10 seconds for sample data containing 1580 trips. This is about 20 times longer than the time to read stops and routes (0.5s).

Cause

In frequency aggregation, the processing time of the stop aggregation accounts for 94% of the total. Most of the time is spent on __get_similar_stop_tuple(). The cause is that __get_similar_stop_tuple() is called for the number of stops by map() for the stops data frame. __get_similar_stop_tuple() is slow because it searches and sorts all of the stops on each call.

Profiling results

Sun Apr 21 02:20:36 2024    chitetsu.prof
         18400825 function calls (17686952 primitive calls) in 17.311 seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    720/1    0.010    0.000   17.334   17.334 {built-in method builtins.exec}
        1    0.000    0.000   17.334   17.334 test_vary_gtfs.py:1(<module>)
        1    0.001    0.001   15.349   15.349 test_vary_gtfs.py:82(main)
        1    0.004    0.004   15.349   15.349 test_vary_gtfs.py:50(exec_test)
        1    0.001    0.001   14.435   14.435 aggregate.py:12(__init__)
        1    0.003    0.003   14.434   14.434 aggregate.py:34(__aggregate_similar_stops)
        7    0.017    0.002   14.391    2.056 {pandas._libs.lib.map_infer}
        6    0.000    0.000   14.127    2.354 series.py:3908(map)
        6    0.000    0.000   14.124    2.354 base.py:1078(_map_values)
     1226    0.071    0.000   13.863    0.011 aggregate.py:91(<lambda>)
     1226    0.082    0.000   13.792    0.011 aggregate.py:134(__get_similar_stop_tuple)
     1228    0.017    0.000    6.339    0.005 frame.py:3197(query)
     1228    0.019    0.000    5.668    0.005 frame.py:3359(eval)
     9877    0.071    0.000    4.191    0.000 frame.py:2869(__getitem__)
     1228    0.021    0.000    3.950    0.003 eval.py:161(eval)
     7376    0.076    0.000    2.737    0.000 managers.py:1436(take)
27257/27244    0.209    0.000    2.714    0.000 series.py:201(__init__)
       25    0.000    0.000    2.666    0.107 __init__.py:1(<module>)
     6149    0.019    0.000    2.654    0.000 generic.py:3355(_take_with_is_copy)
     9820    0.027    0.000    2.383    0.000 common.py:50(new_method)

Solution

I consider that the process will be much faster if the process is done for the entire stops data frame at once.

Sample data

GTFS: feed_chitetsu_chitetsubus_20240326_191913.zip Results of cProfile: chitetsu_prof.txt

MIERUNE / gtfs-parser