Standardize unique_key for incremental models

JamesSLogan commented 1 month ago

Many incremental models were developed before we had access to detector_id, so they were configured to use both station_id and lane to merge in new records. We now have some models using detector_id and some models using the prior method. This can lead to issues due to (potentially invalid) metadata updates. Wherever possible, these unique_key values should be updated to use detector_id, ideally. Part of this task is determining if this strategy will work correctly.

List of files under transform/models currently specifying any unique_key:

intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql
intermediate/clearinghouse/int_clearinghouse__detector_g_factor_based_speed.sql
intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes.sql
intermediate/imputation/int_imputation__detector_agg_five_minutes.sql
intermediate/imputation/int_imputation__global_coefficients.sql
intermediate/imputation/int_imputation__detector_imputed_agg_five_minutes.sql
intermediate/imputation/int_imputation__local_regional_regression_coefficients.sql
intermediate/diagnostics/int_diagnostics__constant_occupancy.sql
intermediate/diagnostics/int_diagnostics__no_data_status.sql
intermediate/diagnostics/int_diagnostics__detector_status.sql
intermediate/diagnostics/int_diagnostics__samples_per_detector.sql
intermediate/db96/int_db96__detector_agg_five_minutes.sql
intermediate/performance/int_performance__station_metrics_agg_hourly.sql
intermediate/performance/int_performance__bottlenecks.sql
intermediate/performance/int_performance__detector_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__station_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__detector_metrics_agg_hourly.sql
marts/imputation/imputation__detector_summary.sql
staging/db96/stg_db96__vds30sec.sql

summer-mothwood commented 1 week ago

I agree James, there are several incremental models that have unique keys set that are not at the most unique / lowest level of grain, which leaves us open to potential CI issues with duplicates when detectors move to different lanes, etc. However, defining the unique keys properly will still not completely eliminate the risk of failures if data comes in weird. For example, if the unique key is set to detector_id + sample_date + sample_timestamp, but the detector moves lanes in the middle of a day, it might show up with two different rows values for the same date with different time stamps in one of the diagnostics tables for the same time period, so in addition to the unique key changes listed below, we should also build uniqueness tests so that we are being alerted properly when duplicates do slip through. There's already a ticket for me for that #340 https://github.com/cagov/data-infrastructure/issues/340 When reviewing all of our incremental models, I found they fell into the following categories:

Category 1 - model's unique keys are already set at the detector_id level [no action needed]
Category 2 - model's unique keys are at the station_id level, which is the correct strategy for that model [no action needed]
Category 3 - model's unique keys are at the station_id + lane level, which should be changed to detector_id, and detector_id is already an output of the model [unique key config should be changed]
Category 4 - model's unique keys are the station_id + lane level, which should be changed to detector_id, but detector_id is not yet present in the output of the model [code for the model should be changed and unique key config should be changed]

Category 5 - model is not currently set to be incremental, but we might want to change that, in which case we would need to define the appropriate unique keys I have listed all incremental models that fall into each of these categories below. The next step in this ticket will be to make the changes to models where appropriate.

Category 1: UNIQUE KEY = DETECTOR_ID + TIME -- no action needed

intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql
intermediate/clearinghouse/int_clearinghouse__detector_g_factor_based_speed.sql
intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes.sql
intermediate/diagnostics/int_diagnostics__detector_status.sql
intermediate/performance/int_performance__detector_metrics_agg_five_minutes.sql
intermediate/diagnostics/int_diagnostics__no_data_status.sql
intermediate/performance/int_performance__detector_metrics_agg_hourly.sql

Category 2: UNIQUE KEY = STATION_ID + TIME but model is at the station level -- no action needed

(The bottleneck models all seem to be based on station ID, not detector ID, so I would keep these as is):

intermediate/performance/int_performance__station_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__station_metrics_agg_hourly.sql
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_daily
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_hourly.sql

note: this model has a unique key at the station_id, lane, sample_date level, and both upstream and downstream models are merged at the detector id level, so it's possible we want to expand this model to be at that level as well, but because of what it's doing, it didn't seem to me that this model needed a change -- curious what you think @JamesSLogan :

intermediate/diagnostics/int_diagnostics__constant_occupancy.sql

Category 3: UNIQUE KEY = STATION ID + TIME and model has upstream/downstream models with an incremental merge strategy that relies on detector ID -- UNIQUE KEY CONFIG NEEDS CHANGE

note: I think the whole imputation journey should be updated to merge uniquely on the detector_id level, just as the non-imputed journey is.

intermediate/diagnostics/int_diagnostics__samples_per_detector.sql
marts/imputation/imputation__detector_summary.sql 
intermediate/imputation/int_imputation__detector_agg_five_minutes.sql
intermediate/imputation/int_imputation__detector_imputed_agg_five_minutes.sql
intermediate/imputation/int_imputation__local_regional_regression_coefficients.sql

Category 4: UNIQUE KEY = STATION ID + LANE + ETC but code / output columns do not currently contain detector IDs BUT the data grain is at the detector level (using station id + lane) -- UNIQUE KEY SHOULD BE UPDATED but CODE NEEDS UPDATING FIRST

intermediate/db96/int_db96__detector_agg_five_minutes.sql
staging/db96/stg_db96__vds30sec.sql
intermediate/imputation/int_imputation__global_coefficients.sql

Category 5: MODELS ARE NOT INCREMENTAL but maybe they should be -- evaluate if unique key would be worth configuring

These are models that are not set to be incremental, probably because these are mostly aggregate models, so we expect them to be small enough where the unique key constraint isn't necessary. But if we are looking for performance gains in CI, it may be worth configuring these as incremental.

models/intermediate/performance/int_performance__detector_metrics_agg_monthly.sql
models/intermediate/performance/int_performance__detector_metrics_agg_weekly.sql
models/intermediate/performance/int_performance__station_aadt_with_K_value.sql
models/intermediate/performance/int_performance__station_metrics_agg_weekly.sql
models/intermediate/performance/int_performance__detector_metrics_agg_daily.sql
models/intermediate/performance/int_performance__station_metrics_agg_daily.sql
models/intermediate/performance/int_performance__station_metrics_agg_monthly.sql
models/intermediate/performance/int_performance__bottleneck_delay_metrics_agg_monthly.sql
models/marts/imputation/imputation__detector_imputed_agg_five_minutes.sql

I am open to any feedback on this analysis! In the meantime, I'll open some PRs to deal with each category where action is called for. @ian-r-rose @jkarpen

JamesSLogan commented 1 week ago

Thanks for the excellent, thorough analysis @summer-mothwood! Regarding int_diagnostics__constant_occupancy, I think it should be changed to use detector_id throughout instead of station_id+lane. This will bring the joins in its downstream model into alignment, as you alluded to. Maybe that puts it into category 4?

ian-r-rose commented 1 week ago

That all sounds correct to me @summer-mothwood.

One note on your category 4: there is a bit of history here, where many of the basal tables start out at the station+lane level, and don't have the metadata included. We've gone back and forth a bit on when we should try to join them so that these extra data are attached (see #214 and linked). One of the reasons it's a bit tricky is that with the large tables the join is quite expensive, so it wasn't always obvious at what stage we absolutely needed to do it.

All of this is to say, there may be good performance-related reasons to not join detector ID in for some of those large category 4 tables. Or maybe it's not a huge deal, but it's something to keep an eye out for.

cagov / caldata-mdsa-caltrans-pems