Open JamesSLogan opened 1 month ago
I agree James, there are several incremental models that have unique keys set that are not at the most unique / lowest level of grain, which leaves us open to potential CI issues with duplicates when detectors move to different lanes, etc. However, defining the unique keys properly will still not completely eliminate the risk of failures if data comes in weird. For example, if the unique key is set to detector_id + sample_date + sample_timestamp, but the detector moves lanes in the middle of a day, it might show up with two different rows values for the same date with different time stamps in one of the diagnostics tables for the same time period, so in addition to the unique key changes listed below, we should also build uniqueness tests so that we are being alerted properly when duplicates do slip through. There's already a ticket for me for that #340 https://github.com/cagov/data-infrastructure/issues/340 When reviewing all of our incremental models, I found they fell into the following categories:
intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes_with_missing_rows.sql
intermediate/clearinghouse/int_clearinghouse__detector_g_factor_based_speed.sql
intermediate/clearinghouse/int_clearinghouse__detector_agg_five_minutes.sql
intermediate/diagnostics/int_diagnostics__detector_status.sql
intermediate/performance/int_performance__detector_metrics_agg_five_minutes.sql
intermediate/diagnostics/int_diagnostics__no_data_status.sql
intermediate/performance/int_performance__detector_metrics_agg_hourly.sql
(The bottleneck models all seem to be based on station ID, not detector ID, so I would keep these as is):
intermediate/performance/int_performance__station_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__station_metrics_agg_hourly.sql
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_daily
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_five_minutes.sql
intermediate/performance/int_performance__bottleneck_delay_metrics_agg_hourly.sql
note: this model has a unique key at the station_id, lane, sample_date level, and both upstream and downstream models are merged at the detector id level, so it's possible we want to expand this model to be at that level as well, but because of what it's doing, it didn't seem to me that this model needed a change -- curious what you think @JamesSLogan :
intermediate/diagnostics/int_diagnostics__constant_occupancy.sql
note: I think the whole imputation journey should be updated to merge uniquely on the detector_id level, just as the non-imputed journey is.
intermediate/diagnostics/int_diagnostics__samples_per_detector.sql
marts/imputation/imputation__detector_summary.sql
intermediate/imputation/int_imputation__detector_agg_five_minutes.sql
intermediate/imputation/int_imputation__detector_imputed_agg_five_minutes.sql
intermediate/imputation/int_imputation__local_regional_regression_coefficients.sql
intermediate/db96/int_db96__detector_agg_five_minutes.sql
staging/db96/stg_db96__vds30sec.sql
intermediate/imputation/int_imputation__global_coefficients.sql
These are models that are not set to be incremental, probably because these are mostly aggregate models, so we expect them to be small enough where the unique key constraint isn't necessary. But if we are looking for performance gains in CI, it may be worth configuring these as incremental.
models/intermediate/performance/int_performance__detector_metrics_agg_monthly.sql
models/intermediate/performance/int_performance__detector_metrics_agg_weekly.sql
models/intermediate/performance/int_performance__station_aadt_with_K_value.sql
models/intermediate/performance/int_performance__station_metrics_agg_weekly.sql
models/intermediate/performance/int_performance__detector_metrics_agg_daily.sql
models/intermediate/performance/int_performance__station_metrics_agg_daily.sql
models/intermediate/performance/int_performance__station_metrics_agg_monthly.sql
models/intermediate/performance/int_performance__bottleneck_delay_metrics_agg_monthly.sql
models/marts/imputation/imputation__detector_imputed_agg_five_minutes.sql
I am open to any feedback on this analysis! In the meantime, I'll open some PRs to deal with each category where action is called for. @ian-r-rose @jkarpen
Thanks for the excellent, thorough analysis @summer-mothwood! Regarding int_diagnostics__constant_occupancy
, I think it should be changed to use detector_id throughout instead of station_id+lane. This will bring the joins in its downstream model into alignment, as you alluded to. Maybe that puts it into category 4?
That all sounds correct to me @summer-mothwood.
One note on your category 4: there is a bit of history here, where many of the basal tables start out at the station+lane level, and don't have the metadata included. We've gone back and forth a bit on when we should try to join them so that these extra data are attached (see #214 and linked). One of the reasons it's a bit tricky is that with the large tables the join is quite expensive, so it wasn't always obvious at what stage we absolutely needed to do it.
All of this is to say, there may be good performance-related reasons to not join detector ID in for some of those large category 4 tables. Or maybe it's not a huge deal, but it's something to keep an eye out for.
Many incremental models were developed before we had access to
detector_id
, so they were configured to use bothstation_id
andlane
to merge in new records. We now have some models usingdetector_id
and some models using the prior method. This can lead to issues due to (potentially invalid) metadata updates. Wherever possible, these unique_key values should be updated to usedetector_id
, ideally. Part of this task is determining if this strategy will work correctly.List of files under transform/models currently specifying any
unique_key
: