ANIALLATOR114 / SimplyTransport

SimplyTransport - API - Website - Ingesting and presenting Transport Information
https://simplytransport.ie
Apache License 2.0
2 stars 3 forks source link

Measure Realtime Accuracy Over Time #89

Open ANIALLATOR114 opened 9 months ago

ANIALLATOR114 commented 9 months ago

Problem to solve

Be able to view or calculate historical arrival delays at a stop for a particular trip.

Challenges

  1. TFI api does not provide the data I expect and need
  2. Storage and querying of large quantities of data will be an issue
  3. Visualisation of this could be done in multiple ways

Working so far

I've already setup the database to store updates over time. Right now its just 30 minutes but I could expand this to days or months with a 1 liner. I've created the query already to return this data so I could visualise the data I'm working with.

Data

image I'll explain below:

  1. This query shows a particular trip, on a particular stop (sequence 15), joined to every realtime update on this trip.
  2. This is where the bus is right now, between stop 12 and 20 somewhere
  3. Each row in this box is a realtime update for various points along the trip, see they all have the same timestamp
  4. These are the delays in minutes. See how sequence 28 is 0, meaning On-Time. See how the values where the bus currently is near are non-sensical. How can the bus be on time at 28, but 63 seconds early for 26.

There are a couple of things at play here.

  1. TFI may not be handling midnight correctly, leading to bugs that swing the times in strange ways.
  2. The bus is a million miles away from stop 28, however this is the actual correct delay value.
  3. Values are not present for every stop, only on some stops.

So if my query trys to only return realtime updates for stop sequence 15 I will not get any rows back.

If I return values near the expected arrival time, these will wildly swing to massive positive or negative values.

Notice the 2 updates (11:54 & 11:55) contained identical delays, despite a minute passing. Its no surprise their own home grown realtime predictions are so strange, counting upwards, ghost busses etc when this is what they're using.

github-actions[bot] commented 9 months ago

Please make sure you have read and understood the contributing guidelines: CONTRIBUTING.md

ANIALLATOR114 commented 1 month ago

Progress

TimeScale DB

I have created a new postgres database with the timescale plugin to support timeseries data sets. This involved a bit of database refactoring to support both databases at once.

Time series data

Time series data ( delays first ) will be recorded every minute using the relatime data available at that exact minute. This will build up historical data over time. I decided to not further record the realtime data over time and instead created a model dedicated to this job. I need as few rows as possible just to storage problems.

Viewing the timeseries data

I have installed a FE library for this which is intended for financial graphing but it will work perfectly fine for displaying the delays too. I havn't yet worked on actually creating the graphs or writing the endpoints to retreive it.

ANIALLATOR114 commented 3 weeks ago

Progress

Api Data Access

Delay data is acessible via the API.

4 endpoints so far to view aggregated or specific data by stop / stoptime / route.

Data buildup

More data is being recorded constantly. During the day it exceeds 500-800 data points recorded per minute. When a route level query is run it's going over potentially millions of rows in the DB.

Here is a sample aggregation for the 15 route.

{
  "avg": 105,
  "max": 1231,
  "min": -1182,
  "standard_deviation": 286.28,
  "p50": 0,
  "p75": 240,
  "p90": 513,
  "samples": 213065
}

200k records used to calculate the metrics.

Performance

Queries are saved in a cache so they don't need to be rerun often. Currently its taking about 1-2s to go over the millions of rows to aggregate an entire route which is still quite acceptable given the scale of the data. This is the heaviest possible query currently.

The specific stop time queries which will be run much for often resolve in just 5-20ms which is excellent and very promising to allow me to load many of them on a single page.