cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
2 stars 0 forks source link

When should we join the VDS raw data with the metadata? #214

Open ian-r-rose opened 1 month ago

ian-r-rose commented 1 month ago

We've gone back and forth on this a few times with @kengodleskidot in the context of some of our large aggregations on 30-second VDS data. Some of our downstream metrics depend on detector metadata, which lives in the STATION_META and related tables, and not on the raw tables. In order to compute these metrics, we need to join the two tables together on STATION_ID.

Normally, we'd do that join as late as possible, so that filtering and especially aggregation can occur, allowing for a cheaper join. For instance, if we join on the 5-minute data, that's 10x smaller of a join than the 30 second data, so we would want to do that if we don't need any of the metadata to do the 5-minute aggregation.

However, a quirk of our base tables is that they are delivered in wide format, with FLOW_1, FLOW_2, FLOW_3 ...FLOW_8, etc. But not every station has eight lanes of traffic! In fact, the modal one has two or three. So when we do some of our large aggregations, we are doing them on more lanes than exist, wasting a lot of effort.

Right now we are doing the aggregations first, and joining with the metadata second. But we should experiment with joining at other stages! We'd be paying higher costs for the joins, but lower costs on the aggregations.

Example DAGs showing the late join with the metadata:

image image
junlee-analytica commented 2 weeks ago

Current path forward is to join on the 5-minute model.