Overall package vision - Githubissues

Access and clean raw data

The fundamental purpose of this package is to access loop detector from MnDOT's JSON feed. The data can be somewhat "dirty", and the package will include functions for finding nulls and interpolating values, flagging impossible values, and formatting column names and classes.

Data storage

There are pros and cons to putting the cleaned sensor data into a database.

By putting it in a database we can flexibly specify the time period we're interested in (at the moment, 3 years' worth of data), the time interval that is relevant (we currently use daily data), and the geographic resolution that we need (nodes comprised of multiple sensors).
Sometimes we will want 15-minute or hourly data; sometimes we will want data going back a decade (for the congestion report), and sometimes we will want data aggregated to the level of corridors, or split down for individual linked lanes of a corridor, or for individual sensors...and any combination of those three kinds of resolution (historic scope, temporal resolution, spatial resolution).
once it's in our database, outside folks can't really use the data. Having the data in an internal only database defeats the open-source idea behind the package, in that we are allowing people to see how we calculate the various measures and QA/QC the data.
We also need to consider the cost of physically storing the data.

Aggregate

The raw data is provided in 30 second intervals. Common temporal aggregations include 10, 15, and 30 minutes, 1 hour, morning and evening peak periods, and 24 hours.

The raw data is accessed for an individual sensor. Sensors can be aggregated up to nodes/stations, corridors, lanes (?). We need functionality for aggregating nodes, stations, and corridors up to polylines.

Calculate

Aggregated data can be used to calculate various measures.

Flow The number of vehicles that pass through a detector per hour
Headway The number of seconds between each vehicle
Density The number of vehicles per mile
Speed The average speed of the vehicles that pass in a sampling period
- UPDATE: Speed is calculated as part of aggregate_sensor_data()
Lost/Spare Capacity The average flow that a roadway is losing, either due to low traffic or high congestion, throughout the sampling period.
- Flow > 1800: 0
- Density > 43: Lost Capacity: Flow - 1800
- Density >= 43: Lost Capacity: 1800 - Flow
Vehicle Miles Traveled (VMT)
Others(?)

General practices

The loop detector data is very large, particularly when working with multiple detectors and days. Generally, rely on {data.table} rather than {dplyr}, {tidyr}, and other packages.
- If you are making the transition from {dplyr} to {data.table}, use {dtplyr} to "translate" between the two. However, don't forget to remove all {dtplyr} functions before pushing.

I just got off the phone with Tim Johnson (MNIT), a software developer working with the MnDOT loop detector data. It was a really nice call - he's super kind and easy to talk to, and totally on board with what we want to do with the data. I called him because I was talking with him about the server issues, and he said that we should chat if I had thoughts about things they could do on the server side (aggregations and transformations of the data) to make our work easier.

He said several things that were really promising. One was that the work we were doing with the traffic data to download, aggregate, transform and load it into our own database was work that was also being duplicated by other groups (academia, gov't) and work they saw as more ideally performed closer to the server side to keep things standardized . I mentioned issues with making sure the way we identified data/sensors as trustworthy, and tracking changes to that "field_length" (vehicle length) attribute over time. He completely agreed and said that they were having similar discussions in his own group.

Another thing is that his team is all about open-source software development. Currently the traffic data server is written in a language called RUST -- I'm not familiar with it at all, and he said not to worry, that we could perhaps submit issues or ideas of things we might want built on the server side, and they could do it.

He also talked about how there are internal discussions about whether they should start to store the data in a formal database, especially if we were going to create derived fields. I said ideally they would have a database that I could just query out the data I needed, at the spatial and temporal aggregation that made sense for me. He agreed that this work needed to be done and that he'd like to involve us more on brainstorming what exactly that would look like.

Metropolitan-Council / tc.sensors

Overall package vision #9

Access and clean raw data

Data storage

Aggregate

Calculate

General practices