This PR updates the test_dbt_models workflow to format run results and upload them to S3 for analysis. In the process, it refactors the format_dbt_test_failures script into a more general form (now called transform_dbt_test_results) that outputs test result metadata to parquet files in addition to outputting the failures workbook.
Schema
Here's the schema of the new generated parquet files:
metadata/test_run/run_id=*/*.parquet: Metadata about the full test run, partitioned by run ID. Populates the qc.test_run table in Athena. Unique by run_id.
run_id: primary key string representing the unique ID of the test run (technically the invocation ID of the dbt test command)
run_date: string representing the UTC date that the test ran, in YYYY-MM-DD format
elapsed_time: decimal representing the total number of seconds that the tests took to execute
run_by: string representing the username of the person who ran the script. On a local machine, this will be the value of the Unix $USER variable set on login; on CI, this will be set to the github.triggering_actor environment variable.
var_year_start: string representing the value of test_qc_year_start for the test run in YYYY-MM-DD format
var_year_end: string representing the value of test_qc_year_end for the test run in YYYY-MM-DD format
git_sha: string representing the hash of the script's commit
git_branch: string representing the branch of the script's commit
git_author: string representing the name and email address of the author of the script's commit
metadata/test_run_result/run_id=*/*.parquet: Metadata about individual test results (i.e. pass, fail, number of failing rows, etc.) in this run, partitioned by run ID. Populates the qc.test_run_result table in Athena. Unique by run_id, test_name, and township.
run_id: composite primary key string representing the unique ID of the test run. Foreign key to qc.test_run.run_id.
test_name: composite primary key string representing the name of the test, e.g. iasworld_sales_cur_in_accepted_values
township_code: composite primary key string representing the township of the test results; can be null if the failing rows can't be mapped to townships, or if the test passed
table_name: string representing the name of the iasWorld table that the test is defined on, e.g. pardat
category: string representing the category of the test, e.g. class_mismatch_or_issue
description: string representing the long description of the test, e.g. at least one class should match pardat class
status: string enum representing the status of the test, one of pass or fail
elapsed_time: decimal representing the number of seconds that the test took to execute
num_failing_rows: integer representing the number of rows that failed the test, or 0 if the test passed
Testing
See here for a successful workflow run, and check the qc.test_run and qc.test_run_result tables in Athena to browse the metadata for that run. An example of the types of queries we can run using this schema:
SELECT run_id, SUM(num_failing_rows)
FROM qc.test_run_result
WHERE status = 'fail' AND category = 'class_mismatch_or_issue'
GROUP BY run_id
This is ready for another look @dfsnow! The tables in Athena should also be up to date from this latest run if you'd like to browse the updated schema.
This PR updates the
test_dbt_models
workflow to format run results and upload them to S3 for analysis. In the process, it refactors theformat_dbt_test_failures
script into a more general form (now calledtransform_dbt_test_results
) that outputs test result metadata to parquet files in addition to outputting the failures workbook.Schema
Here's the schema of the new generated parquet files:
metadata/test_run/run_id=*/*.parquet
: Metadata about the full test run, partitioned by run ID. Populates theqc.test_run
table in Athena. Unique byrun_id
.run_id
: primary key string representing the unique ID of the test run (technically the invocation ID of thedbt test
command)run_date
: string representing the UTC date that the test ran, inYYYY-MM-DD
formatelapsed_time
: decimal representing the total number of seconds that the tests took to executerun_by
: string representing the username of the person who ran the script. On a local machine, this will be the value of the Unix$USER
variable set on login; on CI, this will be set to thegithub.triggering_actor
environment variable.var_year_start
: string representing the value oftest_qc_year_start
for the test run inYYYY-MM-DD
formatvar_year_end
: string representing the value oftest_qc_year_end
for the test run inYYYY-MM-DD
formatgit_sha
: string representing the hash of the script's commitgit_branch
: string representing the branch of the script's commitgit_author
: string representing the name and email address of the author of the script's commitmetadata/test_run_result/run_id=*/*.parquet
: Metadata about individual test results (i.e. pass, fail, number of failing rows, etc.) in this run, partitioned by run ID. Populates theqc.test_run_result
table in Athena. Unique byrun_id
,test_name
, andtownship
.run_id
: composite primary key string representing the unique ID of the test run. Foreign key toqc.test_run.run_id
.test_name
: composite primary key string representing the name of the test, e.g.iasworld_sales_cur_in_accepted_values
township_code
: composite primary key string representing the township of the test results; can be null if the failing rows can't be mapped to townships, or if the test passedtable_name
: string representing the name of the iasWorld table that the test is defined on, e.g.pardat
category
: string representing the category of the test, e.g.class_mismatch_or_issue
description
: string representing the long description of the test, e.g.at least one class should match pardat class
status
: string enum representing the status of the test, one ofpass
orfail
elapsed_time
: decimal representing the number of seconds that the test took to executenum_failing_rows
: integer representing the number of rows that failed the test, or 0 if the test passedTesting
See here for a successful workflow run, and check the
qc.test_run
andqc.test_run_result
tables in Athena to browse the metadata for that run. An example of the types of queries we can run using this schema: