Create data relay pipeline for Station 5-minute summary data and calculate performance measures

kengodleskidot commented 8 months ago

We currently have blocks associated with the imputation logic that calculates speed and fills in data holes from the 30-second raw data. Caltrans staff has multiple efforts ongoing to tackle the current and future calculations associated with the imputation logic but in order to move forward with performance measure calculations, detector diagnostics and their associated aggregations, Caltrans staff would like to build a data relay to bring in the STATION_5MIN_SUMMARY table from the data warehouse into Snowflake.

Once the data set is brought into Snowflake, we can calculate performance measures including Vehicle Hours Traveled (VHT), Vehicle Miles Traveled (VMT), Delay, Q (VMT/VHT) and Travel Time Index (TTI). These values can then be aggregated to the hourly, daily and spatial data sets as needed. These performance measures form the basis for many of the Performance reports and visualizations our users interact with on the PeMS website (e.g. https://pems.dot.ca.gov/?s_time_id=1707609600&e_time_id=1710115140&q=vmt&q2=truck_vht&html_x=34&report_form=1&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&tod=all&tod_from=0&tod_to=0&holidays=on&gb=district&dnode=State&content=loops&tab=det_summary).

@pingpingxiu-DOT-ca-gov has already started building the pipeline and will need to work with ODI to get the STATION_5MIN_SUMMARY table data into Snowflake. Once the data is in Snowflake @ZhenyuZhu-Caltrans and ODI staff can begin working on the performance measure calculations. In the diagram below the STATION_5MIN_SUMMARY table represents the 5-min data with speeds, no holes data set:

[ ] @pingpingxiu-DOT-ca-gov build data relay pipeline for the STATION_5MIN_SUMMARY table in the data warehouse
[ ] @pingpingxiu-DOT-ca-gov will coordinate with ODI on the process to get the STATION_5MIN_SUMMARY table data in Snowflake
[ ] Once the STATION_5MIN_SUMMARY table data is in Snowflake and updating as needed, @ZhenyuZhu-Caltrans and ODI staff can start calculating performance measures
[ ] Once the performance measures are calculated we can start building PeMS Performance reports

ZhenyuZhu-Caltrans commented 7 months ago

Ken: Got it. I will get ready to code the performance measure calculations when 5min data is ready.

ian-r-rose commented 7 months ago

I don't think I agree that we should be loading the 5-minute aggregated data from db96. It's a significant lift, and in my opinion undermines what we are trying to do here, which is re-architect the data analytics pipeline using the raw data.

A few questions that I think it's important to have answers for before doing any loading and modeling based on the 5-minute aggregates:

Do we have a strong understanding of how the current imputation logic works? Are there any references you can share? I've been referring to the descriptions in the User Manual, which are quite high-level, and could use some extra detail. But if we can gain a better understanding, I think it might be a better idea to front-load imputation work.
For the detector health diagnostics: only one test relies on the five minute aggregates: "Occupancy is constant". It's not clear to me why this test needs five-minute data, it seems like it might be doable from the 30 second raw observations. Furthermore, it seems a bit circular to me to rely on imputed data for detector health, when detector health is used in deciding where to perform imputation.
- Do we know why it was implemented that way?
- Can we implement the constant-occupancy test from the 30-second samples?
- If we must use the 5-minute aggregations for this, can we just ignore this health diagnostic until later?

kengodleskidot commented 7 months ago

@ian-r-rose I understand your concerns and if were in a situation where we had the time to understand and calculate the imputation logic, speeds and truck flow variables I would agree but I do not think we can solve the logic for these variables within the timeframe we have left with ODI (June 2024). We have a high-level understanding of the imputation logic and Caltrans staff has been working on this for some time but my understanding is that this will take more time than we have to figure out. We will still be able to calculate most of the data points in the 5-minute table with the raw data and configuration files but when it comes to the imputed values, speed, truck flow and method variables we don't have a good handle at this point.

I believe we can come up with a new way to use the raw data to determine if the "occupancy is constant" detector health diagnostic but it would be a change from the current methodology. I have some thoughts on how we can approach this and will add them to #83

kengodleskidot commented 7 months ago

Based on our sprint discussion from 3/14/2024 we will do the following to develop the 5-minute table and associated aggregations:

Aggregate non-imputed data (flow and occupancy) at the 5-minute level
Develop a simplified calculation to determine speed at the lane level
Assume a truck % to determine truck flow
Determine performance metrics using non-imputed data and previous assumptions

cagov / caldata-mdsa-caltrans-pems

Create data relay pipeline for Station 5-minute summary data and calculate performance measures #111