Closed AlexandraKapp closed 3 years ago
Wow, that uncovered a major :bug: on my machine. Because of locale-specific character translations, the whole reading of that feed failed, because in the original format the name of the trip_id
column in the stop_times
table is rendered as a name
object rather than the expected character
object. I have no idea why this happens, but have implemented a work-around regardless via this commit. That then enables me to reproduce your result, which I'll get onto asap. Thanks!
The good news: It's not a bug, rather just an indication that the cutoff
parameter needs to be modified somewhat:
library (gtfsrouter)
gtfs <- extract_gtfs("./stuttgart.zip")
#> ▶ Unzipping GTFS archive
#> Warning in utils::unzip(filename, exdir = tempdir()): error -1 in extracting
#> from zip file
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress
#> = FALSE): Found and resolved improper quoting out-of-sample. First healed
#> line 109: <<"21-10-j21-1","","10","Marienplatz - Degerloch (Zahnradbahn
#> "Zacke")","1400","FFB300","004299">>. If the fields are not quoted (e.g. field
#> separator does not appear within any field), try quote="" to avoid this warning.
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress =
#> FALSE): Discarded single-line footer: <<"47.T0.78-666-j21-1.>>
#> ✔ Extracted GTFS feed
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
ttable <- gtfs_timetable(gtfs, day = "tuesday")
start_ids <- ttable$stops[grepl("Charlottenplatz", ttable$stops$stop_name)]
start_ids <- start_ids[1:9, ]$stop_id # exclude two "Charlottenplatz in Esslingen)
nrow (gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600))
#> [1] 101
nrow (gtfs_traveltimes(ttable, start_ids [1], from_is_id = T, start_time = 8 * 3600))
#> [1] 8165
Those are the results you observed, but then note what happens with cutoff = 0
as described in the documentation:
nrow (gtfs_traveltimes(ttable, start_ids, from_is_id = T, start_time = 8 * 3600, cutoff = 0))
#> [1] 8982
nrow (gtfs_traveltimes(ttable, start_ids [1], from_is_id = T, start_time = 8 * 3600, cutoff = 0))
#> [1] 8982
Created on 2021-01-13 by the reprex package (v0.3.0)
This cutoff
parameter is just a time-saver, and estimates a point beyond which the rate of increase in numbers of stations starts to drop significantly. Timetables are only scanned to that point, according to this function, with a value of cutoff = 0
switching the cutoff algoritjnm and forcing the whole timetable to be scanned. The whole analysis in that function is only approximate, with your case clearly triggering some strange behaviour. Using only one station likely achieves some kind of more regular increase in stops reached versus time, enabling a much quicker identification of a stop threshold, so the scan is cut short after having only reached very few stations with the default cutoff = 10
.
My plans in regard to this parameter are:
cutoff = 0
to disable the cutoff algorithm.... alternatively, maybe it'll be more sensible to set a default of cutoff = 0
, and add a note that values > 0 can be used to speed up queries, but users should first confirm that no strange behaviour like this occurs. And thanks to this issue, we've got a really good concrete example of what such "strange behaviour" looks like.
ah I see, that makees sense - thanks for the quick response! Then I'll work with the cutoff = 0
That reminds me - I forgot to ask the other day:
What was the advantage again of using the cutoff function instead of using some kind of latest_departure_time
?
cutoff
works on arrival time, and attempts to approximately reach as many stations as possible. A latest_departure_time
would be different, because you'd still never know whether waiting at some intermediate transfer station for a really long time was going to get you anywhere faster.
I see. So you would need another latest_arrival
parameter when the search would break of.
and making the user set this two additional parameters is less intuitive than just setting it to a reasonable value via cutoff by default that gives you (in the best case) a more holistic picture, as paths are followed that still add to reaching new stops?
I'll leave this issue open for now, because the cutoff
parameter was also implemented because of some seemingly odd behaviour at extreme stations, and it was an easy way to clean that up. But once the whole algorithm works with no odd behaviour at all, then cutoff
should hopefully also behave more regularly, and maybe set to default of 0, as described above.
Maybe there could be some kind of minimal set of control parameters potentially submitted as a named list (control = list(cutoff = 10, latest_arrival = XXX)
). First get it working, then work out which parameters could be usefully exposed. The main issue in doing that is always just documentation. As long as everything is clearly documented, it doesn't really matter how many additional parameters can be added that way.
@AlexandraKapp you okay with this issue being closed now?
yes sure :)
A weird bug occured while working with the Stuttgart feed:
Using more stops for the input
from
returns less results than if only one station is used (100 results vs. 8000 results). Feed: https://www.openvvs.de/dataset/gtfs-daten/resource/bfbb59c7-767c-4bca-bbb2-d8d32a3e0378Created on 2021-01-13 by the reprex package (v0.3.0)