CUTR-at-USF / ontime-performance-calculator

An application to calculate on-time performance using archived GTFS-realtime data
Other
4 stars 2 forks source link

Populating schedule_deviation column in dbo.vehicle_positions table #6

Closed mohangandhiGH closed 7 years ago

mohangandhiGH commented 7 years ago

schedule_deviation = (GPS timestamp - scheduled arrival time from GTFS stop_times.txt), in milliseconds. Positive numbers mean the vehicle is running late, while negative numbers mean the vehicle is arriving early.

barbeau commented 7 years ago

It should be noted that this is just an approximation of schedule deviation, as the GPS location will never be at the exact same position as the bus stop (we're recording distance from stop to GPS location in another database field). Future work would include creating heuristics to approximate actual schedule deviation by extrapolating/interpolating when the vehicle actually arrived at the stop, perhaps using distance to stop, street network information, GPS speed, etc.

barbeau commented 7 years ago

@mohangandhiGH One more item I just thought of - I just added another field to the database, timepoint, which is a boolean value. In the HART GTFS data, stop_times.txt contains a value timepoint for each stop in the trip. When writing the closest_stop_id to the database, please check if this closest_stop_id has timepoint set to 1 in stop_times.txt - if yes, then set timepoint to true in the database. If timepoint is set to 0 in stop_times.txt, or if the timepoint field is missing from the GTFS data, then set timepoint to false in the database. If timepoint is set to 0 in stop_times.txt, then set timepoint to false in the database. If the timepoint field is missing from the GTFS data, then leave the timepoint field empty (null) in the database.

barbeau commented 7 years ago

Actually, I need to revise the above:

If timepoint is set to 0 in stop_times.txt, or if the timepoint field is missing from the GTFS data, then set timepoint to false in the database.

...should say:

If timepoint is set to 0 in stop_times.txt, then set timepoint to false in the database. If the timepoint field is missing from the GTFS data, then leave the timepoint field empty (null) in the database.

mohangandhiGH commented 7 years ago

@barbeau Hi Sean, there is a limitation in calculating schedule_deviation. May be you might have told me at some point of time about this. Still I want to memorize it once more... The problem is, when real time feed contains the arrival timestamp as 1:00:00 am but static feed contains the arrival_time as 26:00:00. Here First I tried to get 26:00:00 as (26:00:00 - 24:00:00) 2:00:00 am and then calculate deviation as -1 hrs (i.e., 1:00:00 - 2:00:00). This time it's correct.

But if rt arrival_time is 23:00:00, schedule_deviation would come as +21 hrs (i.e., 23:00:00 - 2:00:00). This is not correct because it should be -3 hrs.

We would get same, but flipped (i.e., first calculation wrong and second calculation right) calculations if we would not have substracted 24 hrs from static arrival_time > 24 hrs.... (Substracting 24 hours, if static arrival_time > 24 hrs does not really making a difference) At first, I have overseen this problem.

Overall, the problem lies in storing the timestamp values in rt feed. If a service day has not finished and if trip goes to the next day, we are not storing timestamps >24 hrs as in static feed.

Currently, the code works fine if both scheduled arrival_time and rt arrival_time falls between 0 and 24 hours.

I need your suggestion on how we can proceed this, whether we make it as a limitation or think some other workaround...

barbeau commented 7 years ago

I might need to think a little more about this, but off the top of my head - in the case of HART, they only run service from around 5am to 1am the following morning: http://www.gohart.org/Pages/maps-schedules.aspx

So I think for now it's ok to assume that if a GPS timestamp is between midnight and 3am, then it is for the service from the previous day, and you can convert the time to seconds after midnight from the previous day.

A way to generalize to other agencies might be to see if there is a gap of GPS data that indicates no service in the middle of the night, and then calculate the middle of that gap and do the same cutoff calculation. There will be corner cases for agencies that run 24/7 service where this doesn't work, but we can deal with that later.

I'll think about this a bit more, but I think the above strategy would be reasonable for now.

mohangandhiGH commented 7 years ago

In our current RT database, there are many records whose timestamp is between 3am and 5am. Can we ignore these records for now?

barbeau commented 7 years ago

GPS timestamps in database should be in UTC, so you'll need to convert them to Eastern time to match the GTFS.