Closed kengodleskidot closed 7 months ago
Ken: Got it. I will get ready to code the performance measure calculations when 5min data is ready.
I don't think I agree that we should be loading the 5-minute aggregated data from db96. It's a significant lift, and in my opinion undermines what we are trying to do here, which is re-architect the data analytics pipeline using the raw data.
A few questions that I think it's important to have answers for before doing any loading and modeling based on the 5-minute aggregates:
@ian-r-rose I understand your concerns and if were in a situation where we had the time to understand and calculate the imputation logic, speeds and truck flow variables I would agree but I do not think we can solve the logic for these variables within the timeframe we have left with ODI (June 2024). We have a high-level understanding of the imputation logic and Caltrans staff has been working on this for some time but my understanding is that this will take more time than we have to figure out. We will still be able to calculate most of the data points in the 5-minute table with the raw data and configuration files but when it comes to the imputed values, speed, truck flow and method variables we don't have a good handle at this point.
I believe we can come up with a new way to use the raw data to determine if the "occupancy is constant" detector health diagnostic but it would be a change from the current methodology. I have some thoughts on how we can approach this and will add them to #83
Based on our sprint discussion from 3/14/2024 we will do the following to develop the 5-minute table and associated aggregations:
We currently have blocks associated with the imputation logic that calculates speed and fills in data holes from the 30-second raw data. Caltrans staff has multiple efforts ongoing to tackle the current and future calculations associated with the imputation logic but in order to move forward with performance measure calculations, detector diagnostics and their associated aggregations, Caltrans staff would like to build a data relay to bring in the STATION_5MIN_SUMMARY table from the data warehouse into Snowflake.
Once the data set is brought into Snowflake, we can calculate performance measures including Vehicle Hours Traveled (VHT), Vehicle Miles Traveled (VMT), Delay, Q (VMT/VHT) and Travel Time Index (TTI). These values can then be aggregated to the hourly, daily and spatial data sets as needed. These performance measures form the basis for many of the Performance reports and visualizations our users interact with on the PeMS website (e.g. https://pems.dot.ca.gov/?s_time_id=1707609600&e_time_id=1710115140&q=vmt&q2=truck_vht&html_x=34&report_form=1&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&tod=all&tod_from=0&tod_to=0&holidays=on&gb=district&dnode=State&content=loops&tab=det_summary).
@pingpingxiu-DOT-ca-gov has already started building the pipeline and will need to work with ODI to get the STATION_5MIN_SUMMARY table data into Snowflake. Once the data is in Snowflake @ZhenyuZhu-Caltrans and ODI staff can begin working on the performance measure calculations. In the diagram below the STATION_5MIN_SUMMARY table represents the 5-min data with speeds, no holes data set: