ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Update `test_dbt_models` workflow to format and upload run results to S3 #370

Closed jeancochrane closed 2 months ago

jeancochrane commented 3 months ago

This PR updates the test_dbt_models workflow to format run results and upload them to S3 for analysis. In the process, it refactors the format_dbt_test_failures script into a more general form (now called transform_dbt_test_results) that outputs test result metadata to parquet files in addition to outputting the failures workbook.

Schema

Here's the schema of the new generated parquet files:

Testing

See here for a successful workflow run, and check the qc.test_run and qc.test_run_result tables in Athena to browse the metadata for that run. An example of the types of queries we can run using this schema:

SELECT run_id, SUM(num_failing_rows)
FROM qc.test_run_result
WHERE status = 'fail' AND category = 'class_mismatch_or_issue'
GROUP BY run_id
jeancochrane commented 3 months ago

This is ready for another look @dfsnow! The tables in Athena should also be up to date from this latest run if you'd like to browse the updated schema.