UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
80 stars 17 forks source link

Fix cutoff parameter of gtfs_traveltimes #75

Closed mpadge closed 3 years ago

mpadge commented 3 years ago

Following on from #57 and comment by @AlexandraKapp, setting cutoff = 0 now takes way too long, and moreover produces this result:

library (gtfsrouter)
gtfs <- extract_gtfs("vbb.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

transfers <- gtfsrouter::gtfs_transfer_table (gtfs, network_times = FALSE)
transfers <- gtfs$transfers %>%
  select(.data$from_stop_id, .data$to_stop_id, .data$transfer_type, .data$min_transfer_time) %>%
  rbind(transfers) %>%
  group_by(.data$from_stop_id, .data$to_stop_id) %>%
  summarise(across(everything(), first)) %>% # take min_transfer_time from gtfs$transfers if present
  data.table::data.table()
gtfs$transfers <- transfers

gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "Berlin Hauptbahnhof"
start_time <- 8 * 3600

system.time (
    x <- gtfs_traveltimes (gtfs, from = from, start_time = start_time, cutoff = 10)
)
#>    user  system elapsed 
#>   0.991   0.003   0.926
format (nrow (x), big.mark = ",")
#> [1] "13,711"
system.time (
    x <- gtfs_traveltimes (gtfs, from = from, start_time = start_time, cutoff = 0)
)
#>    user  system elapsed 
#>  32.005   0.087  32.055
format (nrow (x), big.mark = ",")
#> [1] "40,433"
format (length (unique (gtfs$timetable$arrival_station)), big.mark = ",")
#> [1] "27,594"

Created on 2021-02-01 by the reprex package (v1.0.0)

So the full scan (cutoff = 0) is scanning way too many lines of the timetable. (The number reached can be more than the total number reached in the timetable because of transfers to stations that aren't otherwise reachable from the timetable itself.) First thing to do will be to implement a stop clause that stops scanning the timetable as soon as all possible stations have been reached.

mpadge commented 3 years ago

The second of the above commits replaces the non-transparent cutoff parameter with a much more transparent prop_stops parameter, which enables the timetable scan to stop once the specified proportion of all stations have been reached. This is the result:

library (gtfsrouter)
gtfs <- extract_gtfs("vbb.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

transfers <- gtfsrouter::gtfs_transfer_table (gtfs, network_times = FALSE)
transfers <- gtfs$transfers %>%
  select(.data$from_stop_id, .data$to_stop_id, .data$transfer_type, .data$min_transfer_time) %>%
 rbind(transfers) %>%
  group_by(.data$from_stop_id, .data$to_stop_id) %>%
  summarise(across(everything(), first)) %>% # take min_transfer_time from gtfs$transfers if present
  data.table::data.table()
gtfs$transfers <- transfers

gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "Berlin Hauptbahnhof"
start_time <- 8 * 3600

p <- c (5:9 / 10, 0.95, 0.99, 1) # values for new prop_stops parameter
res <- lapply (p, function (i) {
    st <- system.time (
        x <- gtfs_traveltimes (gtfs, from = from, start_time = start_time, prop_stops = i)
    )
    c (nrow (x), max (x$duration), st [3])    })
res <- do.call (rbind, res)
dat <- data.frame (prop_stops = p,
                   n_stations = res [, 1] / 1000,
                   max_duration = res [, 2] / 3600,
                   time = res [, 3])
scl <- max (dat$time) / as.integer (max (dat$max_duration))

library (ggplot2)
ggplot(dat, aes (x = prop_stops)) + 
    geom_point (aes (y = max_duration, colour = "max_duration"), size = 2) +
    geom_line (aes (y = max_duration, linetype = "max_duration", colour = "max_duration"), size=1.1) + 
    geom_line (aes (y = time / scl, linetype = "time", colour = "time"), size=1.1) + 
    geom_point (aes (y = time / scl, colour = "time"), size = 2) +
    scale_y_continuous (sec.axis = sec_axis (~.*scl, name = "Calculation time (seconds)")) + 
    labs (y = "Max. duration (hours)") + 
    scale_linetype_manual ("colour", values = c ("time" = "solid", "max_duration" = "solid")) + 
    theme (legend.title = element_blank (), 
           legend.background = element_blank (),
           legend.position = c (0.1, 0.9))

Created on 2021-02-01 by the reprex package (v1.0.0)

We could easily expose a function like this that could be used to help determine an appropriate value for prop_stops for a given feed. In the case here for the VBB feed, the value would look like this:

imax <- which.max (diff (max_duration)) - 1L
dat$prop_stops [imax]
#>[1] 0.9

With that value applied to the data above giving:

system.time (
    x <- gtfs_traveltimes (gtfs, from = from, start_time = start_time, prop_stops = 0.9)
)
#>    user  system elapsed 
#>   3.684   0.016   3.632
format (nrow (x), big.mark = ",")
#> [1] "35,870"

Created on 2021-02-01 by the reprex package (v1.0.0)

So that takes quite a bit longer than the default value for the previous cutoff parameter, but that one (with cutoff = 0.1) only reached 13,711 stations, whereas this one reaches almost 3 times as many stations. Times basically scale linearly with prop_stops up to around 0.95 or so, with the previous cutoff = 0.1 parameter corresponding to prop_stops = 0.5. This new prop_stops parameter is, I think, much clearer and enables much more direct control over what is being returned.

AlexandraKapp commented 3 years ago

sorry, I totally missed this issue!

is there an advantage of using the proportion of all stops instead of fixed times? I feel like it's a lot easier to interpret, if I can say: 'the fastest travel time for trips started between 8 and 9 AM' instead of the proportion of stops? (like you said in this issue https://github.com/ATFutures/gtfs-router/issues/57#issuecomment-769022429)

That would then e.g. also allow to check how many stops I can reach in what time during the night. Otherwise the start_time parameter is always a little misleading, as the start time might be 12 hours later.

mpadge commented 3 years ago

sorry, I totally missed this issue!

no worries

is there an advantage of using the proportion of all stops instead of fixed times?

In my opinion, yes. It reflects exactly how the algorithm itself traverses the timetable, so the control is much more direct, and possible via a single, fairly intuitive parameter. It is of course even more intuitive to have parameters directly control times, but that would need at least 3:

So you'd need at least 3 parameters, instead of one. But I guess even more importantly: with a single, sufficiently large, value of prop_stops, you get pretty much all results anyway, and can easily filter afterwards to desired ranges of start times and isochrone durations. Exposing the 3 temporal parameters described above would then require re-running the algorithm every time you tweaked one or more of them, but the actual calculations would be identical each time, so that's kinda wasteful.

That said, it's obviously advantageous to have some more intuitive way to filter the result using temporal parameters for what is after all a set of travel times. Ideas?

AlexandraKapp commented 3 years ago

I see your point.

you get pretty much all results anyway, and can easily filter afterwards to desired ranges of start times and isochrone durations.

start and endtime are not returned by the gtfs_traveltime funtion, are they? I think its only duration. Would it be easy to do so? Then I would agree with you :)

mpadge commented 3 years ago

Would it be easy to do so?

... good question! Those values are currently not returned from the C++ routines back into R, but that shouldn't be too difficult, and I can definitely see the use of that. I'll give that a go as soon as I have a chance.

AlexandraKapp commented 3 years ago

...just one more thought on that. If I had a connection from A to B at 8AM with duration of 15 min and at 3PM with 10 min (same transfers). Would I then get the result of 10min for 3PM and if I filtered that by start time only up to 9AM, then this would not leave me any result for the connection A to B - right? 🤔

mpadge commented 3 years ago

@AlexandraKapp the above commit starts the implementation of a start_time_limits parameter which has 2 values defining lower and upper limits for start times of traveltimes calculations. I'll paste a reprex here when that's done

mpadge commented 3 years ago

That commit implements most of it, but still leaves the problem of when to stop scanning. Currently still has the prop_stops parameter, with results like this:

library (gtfsrouter)
packageVersion ("gtfsrouter")
#> [1] '0.0.4.192'

gtfs <- extract_gtfs("vbb.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "Berlin Hauptbahnhof"

start_time_limits <- 8 * 3600 + c (0, 60) * 60

p <- 5:10 / 10
res <- lapply (p, function (i) {
                   s <- system.time (
                                  x <- gtfs_traveltimes (gtfs,
                                                         from,
                                                         start_time_limits,
                                                         prop_stops = i)
                                  )
                   c (nrow (x), s [3])  })

res <- do.call (rbind, res)
res <- data.frame (prop = p,
                   stops_reached = res [, 1],
                   calc_time = res [, 2])

par (mar = c (5, 4, 2, 4))
plot (res$prop, res$stops_reached, "l", lwd = 2,
      xlab = "prop_stops", ylab = "stops reached")
par (new = TRUE)
plot (res$prop, res$calc_time, "l", lwd = 2, col = "gray",
      yaxt = "n", xlab = "", ylab = "")
axis (side = 4)
mtext ("calculation time (s)", side = 4, line = 2)
legend ("topleft", lwd = 2, col = c ("black", "grey"), bty = "n",
        legend = c ("stops reached", "calcualtion time"))

Created on 2021-03-04 by the reprex package (v1.0.0)

Note that the calculation times are around half of what they were above, simply because only a restricted set of initial departure_time values are used to trace the algorithm. Now I just have to replace the prop_stops parameter with an equivalent duration or max_traveltime parameter.

mpadge commented 3 years ago

Here it is in action:

library (gtfsrouter)
packageVersion ("gtfsrouter")
#> [1] '0.0.4.195'

gtfs <- extract_gtfs("vbb.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "Berlin Hauptbahnhof"

start_time_limits <- 8 * 3600 + c (0, 60) * 60

# confirm that default max_traveltime of 1 hour works:
x <- gtfs_traveltimes (gtfs, from, start_time_limits)
dim (x)
#> [1] 9117    7
hms::hms (as.integer (max (x$duration)))
#> 01:00:00

tlims <- 1:12 * 60 * 60
res <- lapply (tlims, function (i) {
                   s <- system.time (
                                  x <- gtfs_traveltimes (gtfs,
                                                         from,
                                                         start_time_limits,
                                                         max_traveltime = i)
                                  )
                   c (nrow (x), s [3])  })

res <- do.call (rbind, res)
res <- data.frame (max_traveltime = tlims / 3600,
                   stops_reached = res [, 1],
                   calc_time = res [, 2])

par (mar = c (5, 4, 2, 4))
plot (res$max_traveltime, res$stops_reached, "l", lwd = 2,
      xlab = "max traveltime (hours)", ylab = "stops reached")
par (new = TRUE)
plot (res$max_traveltime, res$calc_time, "l", lwd = 2, col = "gray",
      yaxt = "n", xlab = "", ylab = "")
axis (side = 4)
mtext ("calculation time (s)", side = 4, line = 2)
legend ("topleft", lwd = 2, col = c ("black", "grey"), bty = "n",
        legend = c ("stops reached", "calculation time"))

Created on 2021-03-04 by the reprex package (v1.0.0)

For the record here: the whole reason I implemented the prop_stops in the first place was because it is the only way to directly control algorithmic stopping, and so the only way to generate a (roughly) linear relationship between the parameter value and numbers of stops reached. The relationship in this version is quite non-linear, reflecting the rather indirect control over algorithmic stopping behaviour via the max_traveltime parameter. But I agree that allowing direct control over max_traveltime is almost certainly much more intuitive for most users.

Any further comments @AlexandraKapp?

AlexandraKapp commented 3 years ago

looks good! If sth comes up while testing I'll comment here :)