Open dabreegster opened 2 years ago
I've been at Modelling World last 2 days and didn't get much done. I started looking at ticketing events to figure out which GTFS route_short_name
each vehicle covers, over what timespan. Many cases are nice and easy -- one route per vehicle. But plenty of vehicles serves 2 or more routes in one day. I tried a super naive approach to segment the ticketing events, but there are overlapping results like:
VehicleName("03257") serves 2 routes
- from 00:00:00.0 to 23:55:44.0: 341A
- from 07:17:55.0 to 22:06:58.0: 315
VehicleName("03279") serves 2 routes
- from 00:07:22.0 to 23:56:49.0: 212
- from 04:57:33.0 to 15:43:05.0: 323
I need to look more closely, but I think they actually flip between serving two routes through the day, and I have an idea how to segment better.
I did start thinking about the UI to organize all the info we'll have. I think the main view should show the bus network -- drawing stops and lines to show route variants. The main control is filtering. By default, all data is shown. You can optionally filter by a set of dates (either a range, individual days, or weekends or Thursdays or something). This will hide any route variants not operating on those days (and hide any stops not visited). At all times, you also have a list of matching route variants, and you can select just one as well.
In this main view, there can be ways to show spatial patterns by coloring/styling the route lines and stops. There could be ways to show frequency, ridership, delays, and capacity of vehicles filled. Based on the date filters, most of the time this would aggregate over a bunch of matching dates and show average / sum / median / something meaningful.
From this main view, you could click a stop. You'd get a bunch of info, broken down by the route variants that pass through there. There could just be the timetable from GTFS to start, along with some summary/visualization of delays or how the actual AVL trajectory matches the intended timetable. Then there could be a ridership tab, showing both 1st boardings and transfers for that stop + route variant combo. This probably gets shown as a timeseries, with configurable ranges / aggregation rules. Options to export CSV, maybe some kind of simple anomaly detection if a large date range is selected.
Aside from this main network view, I still feel like some kind of "playback one day" mode could be useful, even if just for debugging. This would have the time slider and show real bus position, with styling / tooltips to indicate current capacity and delays. Maybe we draw a "ghost" bus to show how far the actual bus is behind from the schedule.
https://user-images.githubusercontent.com/1664407/173131762-6ff6f5ed-53c6-4780-84af-607c42be5401.mp4
Part of the UI described above is now working. The main view shows stops and route variants, and you can filter by date. Next up will be clicking a stop to see different stats broken down by route variant.
For the moment, I've decided route variant is the main unit of analysis that makes sense. The concept of a route is useful for reporting and communicating service to riders, but seeing exactly the stops it visits and the days it runs seems pretty vital.
You can now click a stop to see all of the route variants passing through it. (And they're filtered down based on the days selected.)
https://user-images.githubusercontent.com/1664407/173396759-d485bfab-722c-4baa-bd64-a3dcc792b989.mp4
Next step is to put stats on boarding events and on delays for this route+stop combo here. But to do that, we still need to join all the datasets. As a small next step there for debugging, I'm drawing a green circle to show ticketing events. You can hover over one to see how far the event is from the bus's current position according to AVL. I haven't taken any stats yet to make sure these datasets match up, but eyeballing it so far, things look good.
https://user-images.githubusercontent.com/1664407/173397272-4b0c29bf-8623-4662-96ed-ca19d78d59bc.mp4
A slew of changes today:
The main work was trying to narrow down which route variant a vehicle was serving based on ticketing data. I tried snapping each ticketing event position to the nearest point on the route's shape, then figuring out the distance along that polyline. If you sort those distances by time of the ticketing events, you'd expect them to smoothly increase or decrease. But they jump all over the place, and I don't know why yet.
To try and figure it out, I also made a way to draw vehicle trajectory by interpolating between these ticketing event positions. First, this only works for 236 of the 376 vehicles -- so a bunch of vehicles don't have any ticketing data for the day? Maybe that's the ones that just kind of idle around near what's probably a bus depot. There are a few cases where ticketing data refers to vehicles not in AVL.
But the bigger problem is when I draw the trajectories built from ticketing, they look like they jump all over the place. They roughly cover the same area as AVL, but the ordering by time seems dodgy:
https://user-images.githubusercontent.com/1664407/173632994-7607d8a2-77f6-4ac6-9375-f21410d48605.mp4
I'm at a Turing conference in Exeter next two days, so expect less progress
First the easy update: on the web version (only), the app now has Mapbox drawn behind it. It doesn't sync super smoothly when you move around, but it's functional. I'll start thinking through how to display the extra routes/vehicles/stops on top with sane colors.
https://user-images.githubusercontent.com/1664407/174327174-8a609e01-d5b0-4953-a574-92de956016a3.mp4
I've otherwise been trying to figure out how to match up the times/positions from ticketing data to narrow down a route variant. I modified the UI to be able to inspect the trajectories better:
https://user-images.githubusercontent.com/1664407/174328324-39501ae8-bec8-428b-9035-d0d5fe8e1e21.mp4
The yellow circle is the selected bus. The pink line shows the "alternate theory" for where that bus is at the current time. The trajectory built from the BIL data will differ from the AVL trajectory when the bus doesn't pick up any passengers on some stops. That explains all of the straight lines between stops shown.
So I still need to come up with a metric for scoring how likely a route variant "explains" AVL or BIL data (or both). I think I'm going to snap ticketing events to possible stops and check the order of those next.
I got kind of stuck today, but I at least wrote down the structure that I'm trying to assemble:
pub struct BoardingEvent {
pub vehicle: VehicleID,
// For convenience
pub variant: RouteVariantID,
pub trip: TripID,
pub stop: StopID,
pub arrival_time: Time,
pub departure_time: Time,
pub new_riders: Vec<JourneyID>,
pub transfers: Vec<JourneyID>,
}
My latest idea to do the matching is to create a time-space trajectory from the GTFS schedule, then try to match AVL trajectories to that. To prune the search, the BIL data will at least say what possible routes (based on short name) a vehicle handles. GTFS trips have an effective start and end time, so that can be used to clip the AVL trajectory. If buses aren't horribly delayed, then some kind of distance function between trajectories might do the trick. I'll give that a shot first.
Flying to New Orleans tomorrow for family stuff, next update might be Wednesday
Hiatus ending now. Some of the updates from the last few days are summarized in #9. I'll just show a few things. First, you can choose a vehicle, find all possible GTFS trips that it might serve, and scroll through trajectories. The pink line shows the equivalent position in the AVL data for that time, to eyeball if the AVL data matches the trip trajectory well or not:
https://user-images.githubusercontent.com/1664407/176040294-23e808d8-5e88-40e7-9473-099deb851e07.mp4
And I made progress splitting the AVL trajectory into non-overlapping pieces, which should make it easier to reason about matching to trips. This splitting is still kind of brittle and has issues near interstate cloverleaves/fly-overs, but it's better:
https://user-images.githubusercontent.com/1664407/176041262-fff531ef-4141-4873-ade5-8638644f57e4.mp4
Small bit of visualization how well the trajectory from an expected GTFS trip matches AVL today:
https://user-images.githubusercontent.com/1664407/176278107-b5bb8399-48f7-409e-95b6-0cd976e29f42.mp4
The red line is AVL, clipped to the time range of the cyan GTFS trip. The geometry often matches pretty decently. The problem is that it looks like there's usually a big time offset / delay from the schedule. So if we look at where the bus is during one trip that's supposed to happen, it's almost meaningless -- the bus could still be working on another trip in the opposite direction.
A vague idea: try to first chop up one vehicle's trajectory into a sequence of route variants. In the common case, that's just two variants (opposite directions) back and forth. Then we know the sequence of expected stop positions, so we can find all times the vehicle passes close to that stop, and put things in order based on the expectation.
The other idea I started today was an inversion of how matching has been happening. Instead of trying to match vehicles to a list of trips, instead look at all of the trips that're supposed to happen as "demand", and vehicles that might be serving that trip as "supply." A vehicle can only serve one trip at a time, so make a table of non-overlapping time intervals per vehicle, and try doing the assignment backwards.
The naive / greedy approach assigns about 3000 (half) the expected trips to vehicles, with the other half unassigned. That's a huge missing chunk. I still need to understand the programmed file exceptions (#8) to account for these, probably.
But a few wins -- for route 349, all 47 trips successfully get assigned to 1 vehicle. This is at least one simple case where we could now try to refine the assignment details.
I feel like I'm close to something working for matching:
https://user-images.githubusercontent.com/1664407/176562974-7460a250-36c5-4072-a971-d9f0d5eba7ba.mp4
You can click a vehicle, see the possible route variants that match, then compare the trajectory to that shape. I'm checking every time the vehicle passes close to each stop position. In the simplest case, we should then be able to just reconstruct trips by looking at the 1st time we're near stop 1, 2, 3... then the 2nd time, and so on.
An example timetable of one trip: 05:56:00.0, 05:59:00.0, 05:59:55.0, 06:00:34.8, 06:01:23.0, 06:02:16.0, 06:03:16.0, 06:04:03.0, 06:04:58.0, 06:05:42.3, 06:06:02.7, 06:06:36.0, 06:05:43.7, 06:08:42.8, 06:09:09.2, 06:09:31.2, 05:49:26.4, 05:48:59.0, 06:11:32.0, 05:48:07.0, 06:13:09.1, 05:47:04.3, 06:14:44.0, 06:15:39.0, 05:45:21.3, 05:44:55.3, 05:44:04.6, 05:43:28.1, 05:43:09.1, 05:42:30.9, 05:41:44.0, 06:23:55.0, 06:25:00.0, 06:25:47.0, 06:26:37.0, 06:28:00.0, 06:28:41.8, 06:30:04.9, 06:31:11.7, 06:32:35.0, 06:33:10.8, 06:33:39.9, 06:33:53.4, 06:36:27.5, 06:37:15.3, 06:38:40.7, 06:40:19.6
Seems good at first, but the time jumps backwards at multiple points. I think part of the problem is that the stops are very close together, so snapping isn't always clear:
But assuming we can work through some of these issues, I think the next steps would be:
1) construct a set of "actual" trips the vehicle makes along each route variant, with arrival time per stop 2) look at all the ticketing events for that vehicle and just snap them to the previous stop time. We can assume the person boarded at that stop, and so we can group all the people boarding together. 3) figure out how to compare "actual" trips to GTFS trips, so we can talk about delay. This could get really strange if there's so much delay that a bus only manages 5 trips instead of the expected 6, for example.
Huge progress with matching! A refinement of the idea from yesterday just enforces that times between stops increase. There's noise when a bus passes close to stops out-of-order, especially when there are two stops on opposite sides of a street. So if we can assume the first time a bus is near the first stop is correct, then we can build things up from there. Some examples for vehicle 224 (showing new UI bits useful for debugging):
https://user-images.githubusercontent.com/1664407/176757674-870d3f3f-5010-4346-97e5-95203bf8fdc3.mp4
Vehicle 224 follows route 430 from 5:56 to 6:40 -- verifying manually, it's a great match! The same heuristic claims another round of this happens from 7:42 to 10:12 -- but that's much longer, what's happening? The problem is between stop 22 -> 23:
https://user-images.githubusercontent.com/1664407/176758653-040824d2-cc75-458f-9713-8e4bb1b3953f.mp4
Using a 10 meter threshold, the trajectory doesn't get close enough until 8:59. Even though from watching manually, it clearly happens at 8:01. This particular issue goes away if I increase the threshold to 20 meters, but then a few more resulting trips are found with unexplainably large ~hour long intervals between nearby stops. I'll keep iterating on this to only wind up with good trips.
Let's finally try to put together a schedule of actual trips for a vehicle by using the above. First attempt, just sorting trips by start time and seeing what fits:
A bunch of trips for RouteVariantID(870)
were skipped. 870 and 873 look like opposite directions of each other. So I'd expect more back-and-forth / interleaving of the schedule.
How about sorting by trip duration and greedily slotting things in where they fit? (Long trips are usually matching bugs right now.) All 4 of the variant 873 trips were skipped, because they overlapped. Those were also long, so maybe those were a weird case anyway. I'll keep checking these results for more vehicles, but more importantly, move onto matching ticketing events and starting to calculate delays from the GTFS schedule. As this underlying trip matching is improved, everything built on top will get better.
Some solid progress:
1) Once we've matched a vehicle to a real sequence of stop times, we match it to the scheduled GTFS trip by minimizing the sum of time differences. 2) We can then crunch through all the ticketing events and find the most likely vehicle that picked them up, by looking for the closest stop time.
For a route variant, we can then display a (very ugly) table: 4 different vehicles served this route over the day, a total of 6 trips. The arrival time at each stop is shown, along with how early/late that is compared to the GTFS trip.
This table also shows people boarding... when there's actually data. Clearly this isn't working yet. 257 + 108 people board one bus at the very last stop, and not during any other stop. While matching ticketing events to actual stop times, I'm building a histogram of the delay between the bus arriving and the tap-on event. The same bug is visible here:
58,399 ticketing events matched to actual trips. 78,619 unmatched
Of the matched, how long between the bus arriving and the ticketing? 58,399 count, 50%ile 6hr 38min 51s, 90%ile 13hr 24min 51.5s, 99%ile 17hr 13min 28.6s, min 1.9s, mean 7hr 7min 17.4s, max 18hr 48min 9.1s
No major algorithmic advances, but I'm now getting much better results for matching everything together. The trick has been to increase the distance between a bus position and stop for a possible match. Currently, the total stats for one day are:
My strategy to keep improving this: 1) Look for matched trips that're really long. There's usually a bug. 2) Or look for long gaps where a vehicle isn't doing anything, but is moving around. It's probably actually serving a trip. 3) Try things to fix the found problems. Validate in that one example case, and by making sure the overall stats trend in the expected direction.
The matching process is now getting to the point where it produces enough useful data to really think about next steps. I have some scattered thoughts about that, but I'll try and post about that tomorrow.
Edit: I also posted sample data and started a discussion at https://github.com/anitagraser/movingpandas/discussions/229 to get more ideas from people working with trajectory data
I fixed a bug with sorting vehicle schedules, and it really improved the sanity check on boarding times vs bus arrival times:
Of the matched, how long between the bus arriving and the ticketing? 99,375 count, 50%ile 43s, 90%ile 57min 26s, 99%ile 5hr 57min 14.6s, min 0s, mean 23min 9.5s, max 14hr 13min 57s Of the matched, how far between the bus stop and the ticketing event? 99,375 count, 50%ile 37.97m, 90%ile 5536.49m, 99%ile 12205.43m, min 0.0372m, mean 5768.81m, max 5336214.9853m
I made a bunch of UI changes to make it easier to debug vehicles doing strange things (not serving routes for a while, serving unusual trips with high delays) and jump back and forth between different bits:
https://user-images.githubusercontent.com/1664407/178543696-31a76554-f034-42fd-89e4-544c1b421cc5.mp4
Finally, I wrote up https://github.com/dabreegster/bus_spotting/blob/main/design.md, which breaks down the work so far as 3 layers. The focus for the remaining weeks will be on layer 3, though the matching in layer 2 still needs work.
I spent the last few days rearranging code to have a clear split between the single day and multiday model. The split includes the UIs -- the interface that lets you replay a single day and debug trajectory matching is totally unrelated to the multiday UI (which right now just displays the GTFS scheduled stuff). Remaining work will be on the multiday UI + analyses.
Edit: and I tried a fun route visualization experiment in #11
Was focused on other projects for the last week while I was at a conference in Paris.
Today I got some UIs started to show the results of multiple days of importing (5 from the provided data). You can color stops to see which ones have the most total boardings and the most daily trips:
https://user-images.githubusercontent.com/1664407/180847353-450bb484-dc16-458e-b9e0-bf2d6fce72c4.mp4
There's also a table breaking down total boardings per route variant, by the hour. The table UI is awful / barely usable. This sort of thing will take me lots of time to improve, so I'm not planning to do it for the scope of this project -- I think the effort would be better spent in the future investigating writing the UI in JS and leveraging a bunch of dataviz libraries that exist.
The results have major outliers near the center of the study area.
I got inspired to work on #11 again. The import now optionally takes a .osm.xml file, pipes it through https://github.com/a-b-street/osm2streets, does extremely simple snapping of bus routes to the street network (with many known limitations), and attempts to draw non-overlapping routes. And as a side effect of bringing in a street network, we can subtly draw that underneath for additional context.
https://user-images.githubusercontent.com/1664407/181508535-0ef83e64-9a83-4a2c-b41d-9048ce52775d.mp4
Now working on docs / a summary for tomorrow's meeting.
I'll post here every few days with current progress/problems. Here's the first. Since this is the first update, I'll also briefly cover what's happened in the past few days.
AVL
There's a mode to load an AVL file and animate bus trajectory, but it doesn't do anything useful yet:
https://user-images.githubusercontent.com/1664407/172400514-dff6dbf1-d7d8-4eec-b701-cc9d6ce88169.mp4
Loading data
Most of the focus the past few days has been getting the import flow to work. When running natively, it's not hard to point to a folder with all the GTFS, BIL, AVL, etc data. But to be a web-first app, it's unclear how to read a whole directory in a browser. So instead, I decided to make the input just be a single .zip file. This can be loaded either natively or in the browser; nothing is uploaded anywhere in either case. That import takes a few seconds (currently) and will increase as we start importing longer timespans.
The import only needs to happen once. So ideally after we import, we save the final model, and can more cheaply load that in the future. That works fine on native, but in the browser, I'm having trouble getting https://github.com/a-b-street/abstreet/blob/1df9eb940a3464ecdab4361cb191ada6807b696c/abstio/src/io_web.rs#L185 to work consistently. There seem to be file size limits with this trick.
GTFS
Most of the focus has been on rearranging GTFS data to be meaningful. Here's a demo of how things work now:
https://user-images.githubusercontent.com/1664407/172401625-92e023ee-1999-4ee2-b4c7-e2c28255cf78.mp4
You can select a single route at a time. A route consists of many individual trips, but these can be further grouped into "variants" that better match intuition. A single route usually has about 4 variants -- outbound and inbound, for weekdays and weekends. A different set of stops is visited in each variant.
There's a "filter dates" widget at the top, but it doesn't work yet. GTFS describes routes that occur over a long timespan, so to ask questions about a route, it may be necessary to also specify when we want to view that route.
I'm trying to build up to the point where we can also load the AVL and BIL datasets, and start to link everything together. For a particular day, we can subset the GTFS schedule and find only the trips scheduled for that day. Then we can link up
route_short_name
and vehicle IDs, and figure out what GTFS trips a vehicle is serving. That'll let us interpret the AVL trajectory and produce an "actual" schedule of stop times (a list of (stop ID, arrival time, departure time) tuples). We can compare that to the "idealized" schedule in GTFS and start to measure delay. At first that definition will just focus on one day at a time, but then we can think about aggregating over longer timespans.