Establish better naming convention for aggregate models

ian-r-rose commented 2 months ago

As we've been adding new models our organization, our model names have become a bit of a mess, without much consistency to them. This is not really a bad thing, we've been learning more about the needs of the data models, and deciding upon useful naming conventions is something that really comes out of having a strong idea of these needs.

I've noticed a few pain points/inconsistencies with our aggregate models in particular:

It's usually not possible to tell from the name whether the aggregate is at the station or detector (or controller!) level. Sometimes the name says station, but it's actually at the detector level. Sometimes it says neither.
We don't have much consistency about how to name different levels of temporal aggregation
We don't have a convention for how to name different levels of spatial aggregation

I'm opening this issue to discuss possible naming conventions going forward. Here is one proposal (but I'm interested in hearing other ideas!)

Proposal

Overall name structure is:

{model stage}_{category}__{device type}_agg_{temporal aggregation}_{spatial_aggregation}

This is pretty verbose, but would capture most of the relevant information about the intent of an aggregate model from the name. Some description of the components:

model stage: This is the stage of a model, such as stg, int, or as we move towards "mart" models, could be something like dim, or fct, or nothing.

category: This is either the data source system (e.g., clearinghouse or db96) or the broad category of intention of the data model (like diagnostics or imputation). This usually would correspond the the name of the directory in which the model file sits.

device type: For VDS data this would usually be "controller", "station", or "detector", and indicates the "grain" of the model. For instance, most of our VDS tables are at the "detector" level because they include a lane column. But when the lanes are aggregated up to the station level, it would be "station". Note that this could also be considered a spatial aggregation, but in most cases I felt it was important enough information to have a dedicated part in the name.

temporal aggregation: One of five_minute, hourly, daily, weekly, monthly, or yearly.

spatial aggregation: Things like district or freeway or county or city.

If one part of the naming convention didn't make sense for a given model, we could drop it from the name. In particular, I'm thinking that with "mart" models, which are intended to be shared with downstream data consumers who might not know our conventions and project structure, we might drop model stage and category.

A couple of examples:

int_clearinghouse__five_minute_station_agg --> int_clearinghouse__detector_agg_five_minute
int_imputation__five_minute_station_agg --> int_imputation__detector_agg_five_minute.
int_clearinghouse__station_temporal_hourly_agg --> int_clearinghouse__station_agg_hourly
When we create a freeway-district five-minute agg model: int_clearinghouse__detector_agg_five_minute_freeway_district

Thoughts @kengodleskidot and @mmmiah?

mmmiah commented 2 months ago

I am good with your proposal. Some additional clarification can be added. Literally all temporal aggregation is also spatial aggregation as we have city, county, lane and vice versa. It would be better to include whether the agg is only station level of station-lane level. May be something like this- int_clearinghousestation_lane_agg_hourly ( by lane by station) int_clearinghousestation_agg_hourly (by Station only)

We can avoid to add too many spatial features as it will produce a big name like int_clearinghouse__station_lane_agg_hourly_city_county_district_freeway_type which may be not required at all. Otherwise I see the value of consistent meaningful name that you proposed. My aggregation need to be also renamed. Let me renamed those based on this proposal. Thanks for coming up with these!

ian-r-rose commented 2 months ago

I am good with your proposal. Some additional clarification can be added. Literally all temporal aggregation is also spatial aggregation as we have city, county, lane and vice versa.

I'm not sure what you mean here: if we are only aggregating by a timestamp, then we are doing any spatial aggregation. I don't think we have any spatial aggregations yet in this project (until we do a group by city, county, what have you). I'm treating lane aggregations a bit differently from other spatial aggregations in this proposal.

It would be better to include whether the agg is only station level of station-lane level. May be something like this- int_clearinghousestation_lane_agg_hourly ( by lane by station) int_clearinghousestation_agg_hourly (by Station only)

In the above proposal, a model with the grain of station_lane is synonymous with "detector".

mmmiah commented 2 months ago

I believe that we grouped by id, timestamp (year, month, week , day) and also district, county, city, type in spatial and temporal aggregation! Yeah 'detector' term makes sense in replace of 'station_lane'. I will change my spatial temporal agg name soon and have a review on it

ian-r-rose commented 2 months ago

In order to do a spatial aggregation, we would need to group by district, county, city, etc without grouping by ID. Otherwise each unique combination of those keys would still be set by the id and timestamp.

Put another way: a spatial group by should have multiple stations per group in order to actually be aggregating anything (similar to how a temporal group by should have multiple timestamps per time bucket)

ian-r-rose commented 2 months ago

A thought about the above proposal: it can miss some important context about the purpose of a model: it only has the device level and the aggregation level. What if we also included an optional "purpose" component that could be, e.g., "metrics"? This starts to get quite verbose, but seems like it could be useful to me...

{model stage}_{category}__{device type}_{purpose}_agg_{temporal aggregation}_{spatial_aggregation}

So int_performance__five_min_perform_metrics would then become int_performance__detector_metrics_agg_five_minutes.

Thoughts @mmmiah?

mmmiah commented 2 months ago

In order to do a spatial aggregation, we would need to group by district, county, city, etc without grouping by ID. Otherwise each unique combination of those keys would still be set by the id and timestamp.

Based on your definition, it seems good to me and agree that we have not done yet the spatial aggregation in that sense

mmmiah commented 2 months ago

A thought about the above proposal: it can miss some important context about the purpose of a model: it only has the device level and the aggregation level. What if we also included an optional "purpose" component that could be, e.g., "metrics"? This starts to get quite verbose, but seems like it could be useful to me...

{model stage}_{category}__{device type}_{purpose}_agg_{temporal aggregation}_{spatial_aggregation}

So int_performance__five_min_perform_metrics would then become int_performance__detector_metrics_agg_five_minutes. Makes more sense!

kengodleskidot commented 2 months ago

Is there a need to include _agg_ since it appears to be something that is common to most (if not all) of our models? For the mart models dropping the model stage makes sense but the category could prove useful for the report/visualization designers to make it easier for them to identify which data sets are associated with a report/visualization type (diagnostic reports vs. performance metrics reports vs. bottleneck reports, etc.). I like making {purpose} as optional since it may not always be required and helps keep the length of the model name as short as possible. On a related note, we probably need to have someone go through the .yml files for QA/QC as well as understanding. I'm thinking this may be a good task for the new staff to help them understand what the models are and what they contain. Any thoughts @mmmiah @ian-r-rose @britt-allen

ian-r-rose commented 2 months ago

I don't think I agree that most of our models are aggregates (at least, not in the sense that the VDS models are). Our metadata models aren't aggregates, the coefficients models use aggregation in computing coefficients, but I don't think they are aggregates in the same way as our VDS ones.

On a related note, we probably need to have someone go through the .yml files for QA/QC as well as understanding. I'm thinking this may be a good task for the new staff to help them understand what the models are and what they contain.

I think this is a good idea!

ian-r-rose commented 1 month ago

Fixed by #243

cagov / caldata-mdsa-caltrans-pems

Establish better naming convention for aggregate models #241

Proposal