ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
6 stars 4 forks source link

Simplify design of failing test notification system #595

Open jeancochrane opened 2 months ago

jeancochrane commented 2 months ago

Background

https://github.com/ccao-data/data-architecture/pull/579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:

If the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.

Proposed refactor

Here is a sketch of a new iteration of the notification system that will be easier to scale:

No new failures New failures
No old failures fixed Do nothing Do nothing
Old failures fixed Create new snapshot Create new snapshot without new failures

Open questions

Next steps

Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.

jeancochrane commented 2 months ago

I did a little bit of research today to determine if we could leverage dbt snapshots for the snapshotting system; my determination is that we currently can't, because they can't handle the "Old failures fixed / new failures" condition in the snapshot matrix listed above.

More specifically, we could theoretically write a check snapshot for each test using the test's generic macro to select failures. This would share one similar problem as the seed-based solution, namely the requirement to create a config file (in this case a snapshot rather than a seed) for each test, but it would fix the problem that the seed has with requiring manual management every time the universe of known failures changes. Then, to update the snapshots we could just run dbt snapshot with a --select clause to select any tests that need an updated snapshot. However, this solution can't handle the behavior "Create new snapshot without new failures," meaning we can't handle the "Old failures fixed / new failures" condition in the snapshot matrix.

It's possible that the updates to snapshot configs in dbt Core 1.9 and 1.10 will make this easier, but I doubt it. Still, I'll be keeping an eye out in case that changes.