Open jeancochrane opened 2 months ago
I did a little bit of research today to determine if we could leverage dbt snapshots for the snapshotting system; my determination is that we currently can't, because they can't handle the "Old failures fixed / new failures" condition in the snapshot matrix listed above.
More specifically, we could theoretically write a check
snapshot for each test using the test's generic macro to select failures. This would share one similar problem as the seed-based solution, namely the requirement to create a config file (in this case a snapshot rather than a seed) for each test, but it would fix the problem that the seed has with requiring manual management every time the universe of known failures changes. Then, to update the snapshots we could just run dbt snapshot
with a --select
clause to select any tests that need an updated snapshot. However, this solution can't handle the behavior "Create new snapshot without new failures," meaning we can't handle the "Old failures fixed / new failures" condition in the snapshot matrix.
It's possible that the updates to snapshot configs in dbt Core 1.9 and 1.10 will make this easier, but I doubt it. Still, I'll be keeping an eye out in case that changes.
Background
https://github.com/ccao-data/data-architecture/pull/579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:
anti_join
argument with a complicated schemaqc.test_run_failing_row
will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows--vars
flag when runningrun_iasworld_data_tests
in thetest-dbt-models
workflowIf the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.
Proposed refactor
Here is a sketch of a new iteration of the notification system that will be easier to scale:
qc.test_known_failure
and should be defined as a source in the dbt DAG.dbt/scripts/update_test_known_failures.py
should expose an interface for managing this state file, including:dbt/scripts/run_iasworld_data_tests.py
script should check whether each failing test has a universe of known failures inqc.test_known_failure
, and use the most recent snapshot to exclude known failures from notifications.dbt/scripts/run_iasworld_data_tests.py
script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as theupdate_test_known_failures
script. In particular, the script needs to handle the following cases:test-dbt-models
workflow should automatically search for SNS topic overrides based on variable names when running therun_iasworld_data_tests
script (see here for original suggestion)Open questions
qc.test_known_failure
be unique by snapshot date, or snapshot version?run_iasworld_data_tests
script compare failing tests to the most recent snapshot inqc.test_known_failure
, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?Next steps
Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.