Simplify design of failing test notification system

Background

https://github.com/ccao-data/data-architecture/pull/579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:

Configuring a new test for notifications requires adding a new test-specific seed representing the universe of known failures
Any generics that are used by tests that are configured to be part of the system must be updated to accept an anti_join argument with a complicated schema
If other teams fix rows that are part of a test's known universe of failures, a Data team member will have to manually intervene to update its seed, or else the system will fail to alert if those rows ever start to fail again
For tests that are part of the notification system, the failure data saved in qc.test_run_failing_row will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows
Adding a new SNS topic requires updating the big chunk of YAML that we pass to the --vars flag when running run_iasworld_data_tests in the test-dbt-models workflow

If the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.

Proposed refactor

Here is a sketch of a new iteration of the notification system that will be easier to scale:

Instead of a seed for every test, the known universe of test failures should be stored in one Parquet dataset that lives remotely on S3. This dataset should have a name like qc.test_known_failure and should be defined as a source in the dbt DAG.
- The dataset should be partitioned by tablename, test name, and snapshot date (or version), and each individual Parquet file should record the set of primary keys representing the universe of known failures for each test at a moment in time.
- A script like dbt/scripts/update_test_known_failures.py should expose an interface for managing this state file, including:
  - Managing different state files depending on the dbt environment (prod or CI)
  - Copying prod state files to CI environments
  - Adding a new snapshot of data for a specific test on the current date
  - Excluding specific rows from a snapshot by primary key
The dbt/scripts/run_iasworld_data_tests.py script should check whether each failing test has a universe of known failures in qc.test_known_failure, and use the most recent snapshot to exclude known failures from notifications.
The dbt/scripts/run_iasworld_data_tests.py script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as the update_test_known_failures script. In particular, the script needs to handle the following cases:

	No new failures	New failures
No old failures fixed	Do nothing	Do nothing
Old failures fixed	Create new snapshot	Create new snapshot without new failures

The test-dbt-models workflow should automatically search for SNS topic overrides based on variable names when running the run_iasworld_data_tests script (see here for original suggestion)

Open questions

Should rows in qc.test_known_failure be unique by snapshot date, or snapshot version?
- Snapshot date would be more intuitive for human readers, but precludes the possibility of multiple snapshots on a single date
- If we go with version, we should probably preserve the date as well, to make the versions more understandable
Should the run_iasworld_data_tests script compare failing tests to the most recent snapshot in qc.test_known_failure, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?
- Using the most recent snapshot will simplify the design, but reduce configurability

Next steps

Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.

ccao-data / data-architecture