performance improve - Githubissues

Kanahiro commented 1 year ago

Large dataset for benchmark https://opentransportdata.swiss/de/dataset/timetable-2021-gtfs2020

time python -m gtfs_parser parse gtfs_fp2021_2021-12-08_09-10.zip swiss
extracting zipfile...
GTFS loaded.

real    218m5.224s
user    56m11.213s
sys     0m11.147s

liorsteinberg commented 5 months ago

Hey @Kanahiro ,

I think the read_routes function can be radically improved in terms of speed and performance.

I noticed the current implementation for generating GeoJSON features from the merged DataFrame iterates over each unique route_id, which can be quite slow, especially for large DataFrames. This is due to the repetitive filtering and sorting operations within a loop.

To enhance performance, I suggest leveraging pandas' groupby and apply methods. This approach efficiently groups the DataFrame by both route_id and trip_id and then applies a function to each group to construct the GeoJSON feature. This method minimizes the repetitive operations and leverages pandas' optimized group processing, which can significantly improve the execution time.

Locally, I replaced this code:

 # parse routes
        for route_id in merged["route_id"].unique():
            route = merged[merged["route_id"] == route_id]
            trip_id = route["trip_id"].unique()[0]
            route = route[route["trip_id"] == trip_id].sort_values("stop_sequence")
            features.append(
                {
                    "type": "Feature",
                    "geometry": {
                        "type": "LineString",
                        "coordinates": route[
                            ["stop_lon", "stop_lat"]
                        ].values.tolist(),
                    },
                    "properties": {
                        "route_id": str(route_id),
                        "route_name": route.route_concat_name.values.tolist()[0],
                    },
                }
            )

with this:

def create_feature(group):
    # Assuming the group is already sorted by stop_sequence, if not, uncomment the next line
    # group = group.sort_values("stop_sequence")
    route_id = group["route_id"].iloc[0]
    route_name = group["route_concat_name"].iloc[0]

    feature = {
        "type": "Feature",
        "geometry": {
            "type": "LineString",
            "coordinates": group[["stop_lon", "stop_lat"]].values.tolist(),
        },
        "properties": {
            "route_id": str(route_id),
            "route_name": route_name,
        },
    }

    return feature

# Ensure the DataFrame is sorted by stop_sequence before applying the function
merged_sorted = merged.sort_values(["route_id", "trip_id", "stop_sequence"])

# Apply the function to each group of route_id and trip_id
features = merged_sorted.groupby(["route_id", "trip_id"]).apply(create_feature).tolist()

Kanahiro commented 3 months ago

Hi, @liorsteinberg sorry for late response. newest version of gtfs-parser is dramatically improved, please try it if you have interest still :)

Kanahiro commented 3 months ago

x500 faster...

time poetry run python -m gtfs_parser aggregate /Users/kanahiro/Downloads/GTFS
_FP2021_2021-12-08_09-10 output
GTFS loaded.

real    0m36.841s
user    0m32.781s
sys     0m3.893s

close this :)

liorsteinberg commented 3 months ago

Amazing work! Thanks

MIERUNE / gtfs-parser

performance improve #1