stop names matched incorrectly

polettif commented 3 years ago

Running the following snippet gives me a route to Zürich, Berninaplatz instead of Bern. Does this work as intended? If to matches multiple stop names I'd expect it to use the exact match.

library(gtfsrouter)
packageVersion("gtfsrouter")
#> [1] '0.0.4.147'
gtfs_url = "https://opentransportdata.swiss/dataset/6f55f96d-7644-4901-b927-e9cf05a8c7f0/resource/5a12e34f-7536-4648-885c-f9febf9f6e0c/download/gtfs_fp2020_2020-12-09.zip"
gtfs_file = tempfile(fileext = ".zip")
download.file(gtfs_url, gtfs_file)

gtfs = extract_gtfs(gtfs_file, quiet = T)
gtfs <- gtfs_timetable(gtfs, date = 20200514)

gtfs_route(gtfs, "Zürich HB", "Bern", start_time = 11*3600)
#>   route_name                trip_name                   stop_name arrival_time departure_time
#> 1         14          Zürich, Seebach      Zürich, Bahnhofquai/HB     10:59:00       11:00:00
#> 2         14          Zürich, Seebach   Zürich, Stampfenbachplatz     11:01:00       11:01:00
#> 3         14          Zürich, Seebach           Zürich, Beckenhof     11:02:00       11:02:00
#> 4         14          Zürich, Seebach       Zürich, Kronenstrasse     11:03:00       11:03:00
#> 5         14          Zürich, Seebach   Zürich, Schaffhauserplatz     11:05:00       11:05:00
#> 6         14          Zürich, Seebach      Zürich, Guggachstrasse     11:06:00       11:06:00
#> 7         10 Zürich Flughafen, Fracht Zürich, Hirschwiesenstrasse     11:08:00       11:08:00
#> 8         14          Zürich, Seebach           Zürich, Milchbuck     11:07:00       11:08:00
#> 9         10 Zürich Flughafen, Fracht        Zürich, Berninaplatz     11:09:00       11:09:00
#>

On a side note, the resulting route does not seem correct, there are some weird transfers happening on Hirschwiesenstrasse/Milchbuck, the actual best route has no transfers.

mpadge commented 3 years ago

I'll check out the main issue right now, but not in the meantime in regard to "wierd transfers" #61, which reflects some huge re-engineering of new gtfs_traveltimes() functions. That also returns trips with fewer transfers than gtfs_routes(), and I would trust that new algorithm over the present gtfs_routes(). The whole package will be updated once gtfs_traveltimes() has reached a stable state.

mpadge commented 3 years ago

That commit implements a grep_fixed parameter, giving the following results, where the warnings at the top of each are the key.

library (gtfsrouter)
packageVersion("gtfsrouter")
#> [1] '0.0.4.150'

gtfs = extract_gtfs("routing.zip") # downloaded from your URL above
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
gtfs <- gtfs_timetable(gtfs, date = 20200514)

gtfs_route(gtfs, "Zürich HB", "Bern", start_time = 11*3600, grep_fixed = TRUE)
#> Warning in station_name_to_ids(i, gtfs, from_to_are_ids, grep_fixed): The name [Bern] matches multiple stops spread up to 333.7km apart.
#> Considering refining matching via `grep` with `grep_fixed = FALSE`.
#>   route_name                trip_name                   stop_name arrival_time
#> 1         14          Zürich, Seebach      Zürich, Bahnhofquai/HB     10:59:00
#> 2         14          Zürich, Seebach   Zürich, Stampfenbachplatz     11:01:00
#> 3         14          Zürich, Seebach           Zürich, Beckenhof     11:02:00
#> 4         14          Zürich, Seebach       Zürich, Kronenstrasse     11:03:00
#> 5         14          Zürich, Seebach   Zürich, Schaffhauserplatz     11:05:00
#> 6         14          Zürich, Seebach      Zürich, Guggachstrasse     11:06:00
#> 7         10 Zürich Flughafen, Fracht Zürich, Hirschwiesenstrasse     11:08:00
#> 8         14          Zürich, Seebach           Zürich, Milchbuck     11:07:00
#> 9         10 Zürich Flughafen, Fracht        Zürich, Berninaplatz     11:09:00
#>   departure_time
#> 1       11:00:00
#> 2       11:01:00
#> 3       11:02:00
#> 4       11:03:00
#> 5       11:05:00
#> 6       11:06:00
#> 7       11:08:00
#> 8       11:08:00
#> 9       11:09:00
gtfs_route(gtfs, "Zürich HB", "^Bern", start_time = 11*3600, grep_fixed = FALSE)
#> Warning in station_name_to_ids(i, gtfs, from_to_are_ids, grep_fixed): The name [^Bern] matches multiple stops spread up to 306.1km apart.
#> Considering refining matching via `grep` with `grep_fixed = FALSE`.
#>   route_name trip_name stop_name arrival_time departure_time
#> 1          8      Brig Zürich HB     10:55:00       11:02:00
#> 2          8      Brig      Bern     11:58:00       12:06:00
gtfs_route(gtfs, "Zürich HB", "^Bern$", start_time = 11*3600, grep_fixed = FALSE)
#>   route_name trip_name stop_name arrival_time departure_time
#> 1          8      Brig Zürich HB     10:55:00       11:02:00
#> 2          8      Brig      Bern     11:58:00       12:06:00

^{Created on 2020-12-14 by the reprex package (v0.3.0)}

Let me know if you think there's a better way to do this, and please re-open with suggestions if so. Also, do you think it would be worth adding an auxilliary function to check station name matching? gtfs_station_name(), which would return a filtered version of gtfs$stops containing matching rows, to enable easy checking of grep patterns when using grep_fixed = FALSE.

Thank you for the inspiration for these improvements!

polettif commented 3 years ago

Thank you for the inspiration for these improvements!

No worries, happy to help and discuss this :)

IMO with R being used for analysis in scripts I'd like to reduce "black box input handling" as much as possible. If an input does not work, throw a meaningful error. A function shouldn't try to fix wrong inputs and, more importantly, users should not have to anticipate a function's error-fixing (like "^Bern$", depending on the default). It should be up to the user to decide how to change the input. That's why tidytransit's raptor only uses stop_ids and the slightly higher level travel_times only accepts exact stop_names. That might be verbose and less convenient but I prefer it that way.

Now, with that philosophical preamble out of the way, I would handle stop names like this:

Exact string comparison of a stop name (with ==)
If theres no match after 1: match with regex. If theres only one matching stop name, carry on (maybe with a message).
If the input matches multiple names, throw an error (although the distance you provided is a nice touch).

If 3 happens I'd say the best way is to use gtfs_station_name() (or find_stop_name or find_stop_id) before running the routing functions. That way the user needs to decide which stop to use because "Bern" and "Zürich, Berninaplatz" are both equally likely to be the right stop. This has the added benefit (IMO) that analysis steps are in more separated functions. I guess my main point is: Finding the right stop based on a string is a separate task from routing, so it should be handled separately.

UrbanAnalyst / gtfsrouter

stop names matched incorrectly #66