Comparing trajectories - Githubissues

dabreegster commented 2 years ago

The core problem of this project is joining a few different datasets, and the difficulty comes down to comparing and segmenting trajectories. So I'll try to explain my understanding so far.

Definitions

A trajectory is a list of (timestamp, point) representing the movement of a vehicle over time. There's a basic interpolate operation that takes a time and returns the vehicle's position at that time. If the time happens to be one of the exact observations in the list, then that position is used. Otherwise we find the two nearest observations and linearly interpolate along a straight line -- aka, we assume constant speed between observed points.

Sources of trajectories

Trajectories come from 3 places.

1) AVL is real data, from a GPS tracker on the buses. Observations are usually under 30s apart. 2) GTFS trips represent an idealized movement according to a timetable. We have a polyline shape for a certain trip, and we know when the bus is supposed to be at different stops along that polyline. So we can use the stops + times as "keyframes" and linearly interpolate along the rest of the polyline. Note that a single trip does not loop back on itself; it's only one direction of a "route." 3) BIL ticketing events record a time and position when somebody boards a bus. We can draw a trajectory out of these, but it often looks funny and doesn't follow any valid route. If a bus doesn't pick up any passengers at some stop, this trajectory won't go that way. Or if the real route's shape between consecutive stops isn't a straight line, then this trajectory will deviate a fair bit.

The problem

We want to chop up the real AVL trajectory into segments, each of them matching up to a scheduled GTFS trip. (Because this will let us calculate delayed arrivals and match up ticketing events to a route + stop + vehicle combination.) An AVL trajectory usually lasts the whole day; a single vehicle will serve multiple GTFS trips over time. It might also spend some of the time not serving any trip, or just driving from the end of one route to the start of another.

For simplicity, assume that I've already narrowed down the possible trip trajectories that might "explain" part of an AVL trajectory. I just need to figure out how to match them up. As an example, here's what the AVL for one bus looks like: Screenshot from 2022-06-27 12-33-50 As the tooltip shows, the trajectory loops back on itself as the bus alternates between serving one direction of a route and another. It's in the same position at many times.

Here is one specific GTFS trip that this bus might be serving: Screenshot from 2022-06-27 12-34-58 The trajectory doesn't loop back on itself. The pink line is comparing to the AVL -- at this moment in time, the AVL trajectory says the bus is somewhere off to the NE. It's pretty far away -- does that mean the bus isn't serving this trip? Or is it just delayed and farther back? (In this example, the GTFS trip goes from SW to NE, so the bus is actually ahead of schedule, if we assume it's serving this trip.)

There are about 40 trips that match this vehicle. The AVL data maps over to some subset of them in a specific order -- but which?

Score functions

I've tried a bunch of different ideas for seeing how closely an AVL trajectory "explains" a trip.

Score only at stops

A GTFS trip is scheduled to arrive at stops at a certain time. Only score at those points. For each arrival time, interpolate the AVL position and sum distance from the stop to the actual vehicle position.

For a few cases, the total score isn't too crazy, but usually it's thousands of kilometers. This is sensitive to delays -- if a bus is really behind schedule, its position at each scheduled time will always be far away.

A possible variation: calculate when the AVL data says a bus is close to each stop, getting a list of actual times. Compare to the expected times. If the trip goes in the wrong direction, the times should be very out-of-order.

Score by position (naively)

Ignore time completely and just look at the shape of the trajectories. Walk along the GTFS trip trajectory and the AVL trajectory, sampling points every 100 meters. Sum the distance between these.

This doesn't work well at all, unless the AVL trajectory happens to start at the first stop of the GTFS trip.

Score by position (offset)

To fix the above, find the first time when the AVL trajectory is close to the first stop of the GTFS trip. Ignore the AVL trajectory before that, and then do the above scoring by position. Still no use of time.

Score by interpolated times

Walk along the trip trajectory every 60 seconds and interpolate the position on both trajectories, scoring those. Sort of like "score by position" in that it compares at regular intervals, but uses time and position. Is this better than scoring only at stops? We don't really know the expected position of the bus at all times, only at stops, so maybe it just adds noise.

Other ideas

Just by direction?

The possible trips that explain AVL are often nearly the same polyline, just reversed. In other words, we know a bus is either the outbound or inbound variation of a route, and we just need to decide which. Should we calculate a really coarse bearing and just use that to prune?

Looking for direction changes

It might be simpler to figure things out if the AVL didn't loop back on itself. If we're looking at just one trajectory, is there a way to chop it into pieces? I tried walking along the shape one line segment at a time, seeing if that segment intersects any previous line segment. This idea could potentially work, but my implementation has too many false positives right now -- it's finding intersections too eagerly:

https://user-images.githubusercontent.com/1664407/176005127-ad33471a-d306-471f-9a1c-1141794c9aef.mp4

Prior art

I haven't looked around much yet to see if there are well-known solutions to these kinds of problems. https://movingpandas.readthedocs.io/en/master/ is one possibly relevant project -- its StopDetector could be a good idea for splitting AVL, if it's often the case that a bus waits some amount of time before starting to serve the opposite direction of a route.

myyong commented 2 years ago

@dabreegster does there exist a ground truth for this?