bug with multiple from stop_ids for gtfs_traveltimes

AlexandraKapp commented 3 years ago

A weird bug occured while working with the Stuttgart feed:

Using more stops for the input from returns less results than if only one station is used (100 results vs. 8000 results). Feed: https://www.openvvs.de/dataset/gtfs-daten/resource/bfbb59c7-767c-4bca-bbb2-d8d32a3e0378

library(gtfsrouter)

gtfs <- extract_gtfs(file.path("gtfs.zip"))
#> > Unzipping GTFS archivev Unzipped GTFS archive  
#> > Extracting GTFS feed
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress
#> = FALSE): Found and resolved improper quoting out-of-sample. First healed
#> line 109: <<"21-10-j21-1","","10","Marienplatz - Degerloch (Zahnradbahn
#> "Zacke")","1400","FFB300","004299">>. If the fields are not quoted (e.g. field
#> separator does not appear within any field), try quote="" to avoid this warning.
#> v Extracted GTFS feed 
#> > Converting stop times to secondsv Converted stop times to seconds 
#> > Converting transfer times to secondsv Converted transfer times to seconds
ttable <- gtfs_timetable(gtfs, day = "tuesday")

print(gtfs$agency[1,])
#>    agency_id agency_name        agency_url agency_timezone agency_lang
#> 1:         1         VVS http://www.vvs.de   Europe/Berlin          DE
#>     agency_phone
#> 1: +49 711 66060
print(gtfs$calendar[1,])
#>    service_id monday tuesday wednesday thursday friday saturday sunday
#> 1:         T0      1       1         1        1      1        0      0
#>    start_date end_date
#> 1:   20201220 20210320

start_ids <- ttable$stops[grepl("Charlottenplatz", ttable$stops$stop_name)] 
start_ids <- start_ids[1:9, ]$stop_id # exclude two "Charlottenplatz in Esslingen)

print(start_ids)
#> [1] "de:08111:6075:1:1" "de:08111:6075:2:2" "de:08111:6075:3:7"
#> [4] "de:08111:6075:4:8" "de:08111:6075:5:1" "de:08111:6075:6:2"
#> [7] "de:08111:6075:7:3" "de:08111:6075:8:5" "de:08111:6075:9:4"

pt_traveltimes <- gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600)
nrow(pt_traveltimes)
#> [1] 101

# only taking the first start station of the previous run results in more results
start_ids <- c("de:08111:6075:1:1")

pt_traveltimes <- gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600)
nrow(pt_traveltimes)
#> [1] 8188

^{Created on 2021-01-13 by the reprex package (v0.3.0)}

mpadge commented 3 years ago

Wow, that uncovered a major :bug: on my machine. Because of locale-specific character translations, the whole reading of that feed failed, because in the original format the name of the trip_id column in the stop_times table is rendered as a name object rather than the expected character object. I have no idea why this happens, but have implemented a work-around regardless via this commit. That then enables me to reproduce your result, which I'll get onto asap. Thanks!

mpadge commented 3 years ago

The good news: It's not a bug, rather just an indication that the cutoff parameter needs to be modified somewhat:

library (gtfsrouter)
gtfs <- extract_gtfs("./stuttgart.zip")
#> ▶ Unzipping GTFS archive
#> Warning in utils::unzip(filename, exdir = tempdir()): error -1 in extracting
#> from zip file
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress
#> = FALSE): Found and resolved improper quoting out-of-sample. First healed
#> line 109: <<"21-10-j21-1","","10","Marienplatz - Degerloch (Zahnradbahn
#> "Zacke")","1400","FFB300","004299">>. If the fields are not quoted (e.g. field
#> separator does not appear within any field), try quote="" to avoid this warning.
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress =
#> FALSE): Discarded single-line footer: <<"47.T0.78-666-j21-1.>>
#> ✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
ttable <- gtfs_timetable(gtfs, day = "tuesday")

start_ids <- ttable$stops[grepl("Charlottenplatz", ttable$stops$stop_name)] 
start_ids <- start_ids[1:9, ]$stop_id # exclude two "Charlottenplatz in Esslingen)
nrow (gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600))
#> [1] 101
nrow (gtfs_traveltimes(ttable, start_ids [1], from_is_id = T, start_time = 8 * 3600))
#> [1] 8165

Those are the results you observed, but then note what happens with cutoff = 0 as described in the documentation:

nrow (gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600, cutoff = 0))
#> [1] 8982
nrow (gtfs_traveltimes(ttable, start_ids [1], from_is_id = T, start_time = 8 * 3600, cutoff = 0))
#> [1] 8982

^{Created on 2021-01-13 by the reprex package (v0.3.0)}

This cutoff parameter is just a time-saver, and estimates a point beyond which the rate of increase in numbers of stations starts to drop significantly. Timetables are only scanned to that point, according to this function, with a value of cutoff = 0 switching the cutoff algoritjnm and forcing the whole timetable to be scanned. The whole analysis in that function is only approximate, with your case clearly triggering some strange behaviour. Using only one station likely achieves some kind of more regular increase in stops reached versus time, enabling a much quicker identification of a stop threshold, so the scan is cut short after having only reached very few stations with the default cutoff = 10.

My plans in regard to this parameter are:

Ignore it for the moment until the real bugs can be ironed out; and only tnen
Likely leave as is, and simply document that any unexpected behaviour like this can generally be overcome by using cutoff = 0 to disable the cutoff algorithm.

... alternatively, maybe it'll be more sensible to set a default of cutoff = 0, and add a note that values > 0 can be used to speed up queries, but users should first confirm that no strange behaviour like this occurs. And thanks to this issue, we've got a really good concrete example of what such "strange behaviour" looks like.

AlexandraKapp commented 3 years ago

ah I see, that makees sense - thanks for the quick response! Then I'll work with the cutoff = 0

That reminds me - I forgot to ask the other day: What was the advantage again of using the cutoff function instead of using some kind of latest_departure_time?

mpadge commented 3 years ago

cutoff works on arrival time, and attempts to approximately reach as many stations as possible. A latest_departure_time would be different, because you'd still never know whether waiting at some intermediate transfer station for a really long time was going to get you anywhere faster.

AlexandraKapp commented 3 years ago

I see. So you would need another latest_arrival parameter when the search would break of. and making the user set this two additional parameters is less intuitive than just setting it to a reasonable value via cutoff by default that gives you (in the best case) a more holistic picture, as paths are followed that still add to reaching new stops?

mpadge commented 3 years ago

I'll leave this issue open for now, because the cutoff parameter was also implemented because of some seemingly odd behaviour at extreme stations, and it was an easy way to clean that up. But once the whole algorithm works with no odd behaviour at all, then cutoff should hopefully also behave more regularly, and maybe set to default of 0, as described above.

Maybe there could be some kind of minimal set of control parameters potentially submitted as a named list (control = list(cutoff = 10, latest_arrival = XXX)). First get it working, then work out which parameters could be usefully exposed. The main issue in doing that is always just documentation. As long as everything is clearly documented, it doesn't really matter how many additional parameters can be added that way.

mpadge commented 3 years ago

@AlexandraKapp you okay with this issue being closed now?

AlexandraKapp commented 3 years ago

yes sure :)

UrbanAnalyst / gtfsrouter

bug with multiple from stop_ids for gtfs_traveltimes #70