cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
1 stars 0 forks source link

Aggregations by Station, Space and Time #205

Closed kengodleskidot closed 5 days ago

kengodleskidot commented 1 month ago

Aggregation by Station: After we have a set of 5-minute, lane-by-lane data that consists of flow, occupancy and speed which doesn't contain any holes, our next step is to aggregate this data to one set of flow, occupancy and speed values at each detector location. This means that we aggregate across the lanes. We perform the following computations for each variable:

For flow, we sum the flow across the lanes. For occupancy, we take the average across the lanes. For speed, we compute the flow-weighted harmonic mean speed. Since we started with a complete set of data (which might have been imputed), we can perform these operations.

Aggregation over Time and Space: Once we have the 5-minute raw data of flow and occupancy, and the calculated values of speed and the performance measures, we then aggregate up in time and space. One of the major goals of the PeMS system is to report performance measures over geographical segments for long periods of time. For example, we would like to be able to report the total VMT per day for a particular District over a year. This involves a lot of data. In order to do this in a timely manner we aggregate the values in the database beforehand.

Over time we aggregate the 5-minute values to hourly, daily, weekly and monthly values. For flow, VMT, VHT and delay, we simply sum to get the next value up. For occupancy and speed we just average. Once you get above the hourly values, it doesn't really make sense to look at values of occupancy and speed. For Q and TTI, those values are computed off of the aggregates at that level.

We also aggregate over larger geographical segments. While we have the ability to display the information over a couple of different geographical segments (loop, county, district, state), we have found that we only need to aggregate to the county level. That provides enough aggregation so that the queries over the state are reasonably fast and at the same time that we can easily see the detail at the county-by-county level. Over larger geographical segments, users are interested in quantities that can be summed over different physical locations and still retain their meaning. This includes VMT, VHT and delay. The measures Q and TTI are simply ratios of these values.

mmmiah commented 1 month ago

@kengodleskidot, I can start developing model for these aggregation if you have not already started! Thank you!

ian-r-rose commented 1 month ago

Thanks for writing this up @kengodleskidot. On the topic of (pseudo)geographic aggregations: is there also interest in aggregating by highway? Or by highway-district?

The other thing I would note is that we needn't necessarily wait on the imputation work to start doing some of these aggregations: the imputed data should have the same structure as the unimputed data, so we can easily swap in imputed data later as it comes online. (you're probably already thinking of doing this)

junlee-analytica commented 3 weeks ago

Mintu has completed this issue and is awaiting review.

junlee-analytica commented 1 week ago

Issue has been reviewed. Metadata for daily/hourly aggregation is costly to compute and takes a long duration (25-35 minutes) to extract. Mintu is looking into optimizing this process.