CivicTechTO / ttc_subway_times

A scraper to grab and publish TTC subway arrival times.
GNU General Public License v3.0
40 stars 30 forks source link

Thots on processing NTAS data #1

Open samkodes opened 7 years ago

samkodes commented 7 years ago

Hi Raphael – I have a few brief thoughts on beginning to process the NTAS data that I thought I’d share in case they’re helpful. I’d be happy to start playing with implementing this processing once some data is collected (even a day’s worth).

There are two general approaches that I think could be fruitfully combined. The first approach tries to reconstruct the system’s prediction model by extracting predicted travel times between stations and looking for typical and exceptional patterns (if the system is really dumb, there will be no exceptional patterns and all we’ll get is constant travel times between stations; if the system is smart, we’ll get more information – see below). The second approach tracks variation in predicted times for each train as it moves through the system.

Both approaches assume a database that stores a single train prediction per record, with some pre-processing done to create a field called PAT (predicted arrival time) – just createDate + timeInt. So a record would have stationId, trainId, PAT, createDate, etc. I’m assuming a trainId refers to a “run” of the train, as Sammi deGuzman’s blog suggests. If the same trainId appears on multiple runs, some time-based filtering will have to happen below to make sure we’re picking up only a single run of a train.

1) Reconstructing the system’s prediction model by extracting predicted travel times between stations.

Suppose we have two records with the same trainId and different stationId’s. Then subtracting PATs gives us a travel time estimate (TTE) between those stations (technically, it also includes load/unload times).

If the system is stupid, TTEs between any pair of stations will be constant. This means that there’s a very high degree of redundancy in the NTAS data and there’s no reason to save observations of the same train from multiple stations for future analysis (or alternatively, observations of the same train from multiple stations at different times can be combined very easily).

If the system is smart, TTEs could vary for a number of reasons:

Simply making a histogram of TTEs for any pair of stations should tell us whether the system is smart or not and what kinds of variations it might be picking up. If the system is smart, looking at unusual TTEs and seeing how they move around between stations might give us insight into how local delays propagate through the prediction model.

If building a table of TTEs, it’s probably a good idea to record the data the TTEs came from – i.e. the two original records that generate each TTE. The table should also contain a creationDate, though it’s not clear what that date should be if the records used to create the TTE have different times (they certainly will, since we’re doing low-frequency sampling). So record both creationDates?

Some filtering will be required when creating TTEs to use only records sampled close together in time (say, choose the closest times possible, and enforce some maximum time difference); this avoids junk estimates produced if traffic conditions change between the sampling of the two original records.

2) Tracking train predictions

Suppose we have multiple records with the same trainId and stationId. Order them by creationDate and subtract the first PAT from all the others (alternatively, could calculate running differences); augment each record by putting this difference in a field called “localDelay”. This seems good enough to start identifying problems. Comparing local delays across stations will also help describe how they propagate through the prediction model.