Open ANIALLATOR114 opened 9 months ago
Please make sure you have read and understood the contributing guidelines: CONTRIBUTING.md
I have created a new postgres database with the timescale plugin to support timeseries data sets. This involved a bit of database refactoring to support both databases at once.
Time series data ( delays first ) will be recorded every minute using the relatime data available at that exact minute. This will build up historical data over time. I decided to not further record the realtime data over time and instead created a model dedicated to this job. I need as few rows as possible just to storage problems.
I have installed a FE library for this which is intended for financial graphing but it will work perfectly fine for displaying the delays too. I havn't yet worked on actually creating the graphs or writing the endpoints to retreive it.
Delay data is acessible via the API.
4 endpoints so far to view aggregated or specific data by stop / stoptime / route.
More data is being recorded constantly. During the day it exceeds 500-800 data points recorded per minute. When a route level query is run it's going over potentially millions of rows in the DB.
Here is a sample aggregation for the 15 route.
{
"avg": 105,
"max": 1231,
"min": -1182,
"standard_deviation": 286.28,
"p50": 0,
"p75": 240,
"p90": 513,
"samples": 213065
}
200k records used to calculate the metrics.
Queries are saved in a cache so they don't need to be rerun often. Currently its taking about 1-2s to go over the millions of rows to aggregate an entire route which is still quite acceptable given the scale of the data. This is the heaviest possible query currently.
The specific stop time queries which will be run much for often resolve in just 5-20ms which is excellent and very promising to allow me to load many of them on a single page.
Problem to solve
Be able to view or calculate historical arrival delays at a stop for a particular trip.
Challenges
Working so far
I've already setup the database to store updates over time. Right now its just 30 minutes but I could expand this to days or months with a 1 liner. I've created the query already to return this data so I could visualise the data I'm working with.
Data
I'll explain below:
There are a couple of things at play here.
So if my query trys to only return realtime updates for stop sequence 15 I will not get any rows back.
If I return values near the expected arrival time, these will wildly swing to massive positive or negative values.
Notice the 2 updates (11:54 & 11:55) contained identical delays, despite a minute passing. Its no surprise their own home grown realtime predictions are so strange, counting upwards, ghost busses etc when this is what they're using.