Handling uncertainty in travel time calculations

Hussein-Mahfouz commented 10 months ago

How to address travel time uncertainty when calculating travel time matrices? For background, see:

Transport Access Manual: A Guide for Measuring Connection between People and Places - section 5.3 on the Modifiable Temporal Unit Problem
this vignette for a description and a solution using the time_window() parameter in r5r

The time_window parameter (in combination with the percentile parameter) is ideal, but it can only be used with frequency-based gtfs feeds. From the vignette:

Please keep in mind that the time_window only affects the results when the GTFS feeds contain a frequencies.txt table.

Solution using `time_window` parameter in r5r

One solution is to create a function to convert stop_times to frequency, and use that to edit the gtfs feeds so that they are frequency based feeds

See my comment https://github.com/ipeaGIT/gtfstools/issues/69#issuecomment-1693191738 for getting started on the function, and https://github.com/ipeaGIT/r5r/issues/282#issuecomment-1693179215 to understand how r5 handles the time_window argument when you are using a gtfs feed without a frequencies.txt file

Hacky manual solution

We can pass different departure times to the travel_time_matrix function (e.g for 8:00am, use 7:55, 8:00, 8:05). This is a hacky way of recreating the time_window functionality, and it will definitely be lot slower

Hussein-Mahfouz commented 10 months ago

I am trying to create a frequencies.txt file so that the routing can use the time_window() parameter.

I tried to use the get_route_frequency() function in tidytransit, but it depends on having a direction_id column in the trips.txt file. This is an optional column in the gtfs feed, and is not present in BODS data

I tried to create the column by grouping trips by route_id and service_id, with the expectation that there should be two trips in each group, and I can give them 0 / 1 values, but turns out there are routes with more than 2 trips:

I tried to plot these trips to see how they are different. Here is a facet plot (by trip_id):

It looks like 2 are the same (they even have the same stop sequence not opposite which seems wrong to me). The other 3 are all different

Based on these results, I think I should treat each trip separately if I were to calculate frequencies from stop_times (and ignore the route level logic used in get_route_frequency() ). This is more in line with the gtfs frequencies.txt, which has the following columns: trip_id | start_time | end_time | headway_secs

Robinlovelace commented 10 months ago

GTFS datasets and official timetables are notoriously out in Leeds and presumably beyond. No comments on this other than: great you're considering this and that there are already some implementations.. One question: is system reliability/uncertainty measured? Not my area, fascinated to learn of methods + eventually results.

Hussein-Mahfouz commented 10 months ago

Reliability

I read a bit of literature on system reliability, and listened to a nice episode about it with Niels van Oort. I've seen some analysis on actual vs scheduled services. You could probably use the live bus location api to compare scheduled services to what actually ran. Reliability is a whole area of research and I would prefer not to get into it for the first research question as I am not up to date on the literature. Let me know if you have any thoughts about it for this research question or for later on in the research

Uncertainty

The one thing that I think would be useful to use is the percentiles argument in r5r::travel_time_matrix(). From the documentation:

In this case, there isn’t a single estimate of travel time / accessibility, but a distribution of several estimates that reflect the travel time / accessibility uncertainties in the specified time window. To get our heads around so many estimates, we can use the percentiles parameter to specify the percentiles of the distribution we are interested in. For example, if we select the 25th travel time percentile and the results show that the travel time estimate between A and B is 15 minutes, this means that 25% of all trips taken between these points within the specified time window are shorter than 15 minutes.

It's a useful parameter that deals with the uncertainty of matching a very specific departure time with fixed scheduled services. A high percentile (say 75%) could be used.

Robinlovelace commented 10 months ago

percentiles() sounds like a reasonable and simple approach. :+1: to not getting too sidetracked also.

Hussein-Mahfouz commented 8 months ago

stop_times_to_frequencies() is a difficult function to implement.

A frequencies.txt file has trip_id | start_time | end_time | headway_secs.
In the DfT gtfs feeds (as with most other feeds), the trip_id is unique to one departure on a specific route; if 10 buses have the sameHow do you group trips to get headway_secs? route_id and direction_id, they will still have 10 different trip_ids. _How do you group trips to get headway_secs?_
I tried grouping trips by looking at their stop_sequence and creating a column that had the stop_ids in order. https://github.com/Hussein-Mahfouz/drt-potential/blob/c42453c02aa62b307bcc86595d85c373a583942e/R/stop_times_to_frequencies.R#L28-L30 This column could then be used to grup trips that run on the same exact itinerary. We could then get the number of vehicles and calculate a headway
This solution doesn't account for the service_id parameter. Different service_ids reflect the same trip at different days, so a trip will be repeated multiple times in stop_times.txt. This means our calculated headway_secs is overinflated and innacurate. How do we calculate headway while accounting for different services?:
- We could filter the feed to a specific date. We could then use the above logic normally. I don't like this solution as it reduces the data in a feed from weeks / months to a single day.
- We could get the headway for each trip + service combination. We can join the service_id column to the stop_times file, and then group by service_id + stop_id_order (the column we created to identify unique trips)

Hussein-Mahfouz commented 8 months ago

One important thing to note is that the time_window parameter in r5r DOES work with feeds that don't have a frequencies.txt file. here are the results of using the expanded_travel_time_matrix function with a 30 minute time_window

For the same departure time, the results are the same for each draw_number. However, if our time_window = 30, we have 30 different departure times for each OD pair, and each one has a different travel_time.

The percentiles argument also works, as shown here:

The reason they say frequencies.txt is needed is in order to simulate changes in the start time. That would lead to different draws for the same OD pair having different travel times. For our purposes this is not necessary.

What this means is that a stop_times_to_frequencies() function is not necessary for our purposes

Hussein-Mahfouz / drt-potential