cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
5 stars 0 forks source link

Use consistent naming conventions for volume, speed, occupancy columns #277

Open ian-r-rose opened 1 month ago

ian-r-rose commented 1 month ago

We are currently not very consistent about column names for volume, occupancy, and especially speed. A few things I see in the current project:

  1. In some places the source system flow and in others occupancy. We've mostly standardized on occupancy, but we should validate that we are doing that consistently
  2. In some places we are using occupancy, and in others occupancy_avg.
  3. In some places we are using speed, in others we are using speed_five_mins, and in others we are using speed_weighted
  4. In some places we are using volume and in others we are using volume_sum
  5. In some of our aggregates, the temporal aggregation is in the column name (e.g. weekly_volume), in others we are not.

Proposal

I propose the following conventions:

  1. Always use "volume" over "flow"
  2. Don't include the aggregation type in the name. Volume is essentially always summed, occupancy is essentially always averaged, speed is always weighted
  3. Validate that we are in fact using weighted speed for all aggregations
  4. Don't use temporal aggregation in the name. This would make it easier to swap between models of different temporal aggregation levels, since the columns would have the same name.

Basically, the above amounts to always using the simplest names occupancy, volume, and speed, rather than trying to encode more information about the aggregations in the column names.

Thoughts?

kengodleskidot commented 1 month ago

Below are my thoughts:

  1. Agree that we should be consistent and using volume over flow makes sense to me

  2. I believe we can drop the aggregation type in the name, but I believe there is value in including if the value is observed vs. imputed values/normalized values and the method of imputation. There are a variety of use cases where users want to see the difference between observed (non-imputed), imputed and normalized values. Below are some screenshots in the current PeMS so you can see some associated reports: image image image image

  3. This is a QA/QC step that we should validate. For imputed speed I do not believe we are aggregating at the 5-minute detector level but higher-level aggregations (hourly, daily, etc.) and the station level should be confirmed.

  4. I see the convenience of using the same name for occupancy, volume, and speed across multiple models but there is the potential of misusing these values based on the level of aggregation. This is primarily a concern for me on the reporting side and ensuring the correct value is being used for the level of aggregation a report is displaying (e.g. not using the 5-minute speed in an hourly aggregated report). If there are any best practices on how to minimize potential misuse of values with the same name but different aggregations that would be helpful.

jkarpen commented 1 month ago

@ian-r-rose @mmmiah Please add your thoughts when you have the chance.