This PR fixes our broken build-daily-dbt-models workflow by improving build performance for the reporting.res_report_summary table, which has been exhausting Athena query resources ever since we merged https://github.com/ccao-data/data-architecture/pull/641.
The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as reporting.res_report_summary. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.
While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as daily and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.
One idea to mitigate this downside might be to add a when-loaded column to res_report_summary and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since the build-and-test-dbt workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.
Alternatives I considered
Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.
Testing
See this workflow run for evidence that build-daily-dbt-models now runs successfully.
I checked to make sure row counts match in the two assets:
select
assessment_stage,
prod_count,
dev_count
from (
select
assessment_stage,
count(*) as prod_count
from reporting.res_report_summary
group by assessment_stage
) prod
left join (
select
assessment_stage,
count(*) as dev_count
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
where assessment_stage in ('ASSESSOR CERTIFIED', 'MAILED', 'BOR CERTIFIED', 'model')
group by assessment_stage
) dev
using (assessment_stage)
#
assessment_stage
prod_count
dev_count
1
MAILED
12306
12306
2
BOR CERTIFIED
9329
9329
3
model
8920
8920
4
ASSESSOR CERTIFIED
11259
11562
Confirming the difference in certified values is entirely 2024 data:
select year, assessment_stage, count(*) as count
from (
select triad, geography_type, geography_id, property_group, assessment_stage, year
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
except
select triad, geography_type, geography_id, property_group, assessment_stage, year from reporting.res_report_summary
) comparison
group by year, assessment_stage
#
year
assessment_stage
count
1
2024
ASSESSOR PRE-CERTIFIED
379
2
2024
ASSESSOR CERTIFIED
303
Making sure the difference is all new towns that have closed since we last built res_report_summary in prod:
select distinct geography_type, geography_id
from (
select triad, geography_type, geography_id, property_group, assessment_stage, year
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
except
select triad, geography_type, geography_id, property_group, assessment_stage, year from reporting.res_report_summary
) comparison
where geography_type = 'Town' and assessment_stage = 'ASSESSOR CERTIFIED'
order by geography_id
#
assessment_stage
count
1
Town
37
2
Town
72
I also spot checked values, and noticed that there are a number of towns that have slightly different sale or PIN counts between the dev and prod versions of res_report_summary. This can throw off the reported distributions for other values, since the extra data can shift the distribution very slightly. See this query for an example that looks only at mailed and model assessment stages:
select
triad,
geography_type,
geography_id,
property_group,
assessment_stage,
year,
dev.pin_n as dev_pin_n,
prod.pin_n as prod_pin_n,
abs(coalesce(dev.pin_n, 0) - coalesce(prod.pin_n, 0)) as pin_n_mismatch,
dev.sale_n as dev_sale_n,
prod.sale_n as prod_sale_n,
abs(coalesce(dev.sale_n, 0) - coalesce(prod.sale_n, 0)) as sale_n_mismatch
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary dev
full outer join reporting.res_report_summary prod
using (
triad,
geography_type,
geography_id,
property_group,
assessment_stage,
year
)
where assessment_stage not in ('ASSESSOR PRE-CERTIFIED', 'ASSESSOR CERTIFIED')
and (
coalesce(dev.pin_n, 0) != coalesce(prod.pin_n, 0) or
coalesce(dev.sale_n, 0) != coalesce(prod.sale_n, 0)
)
I'm not sure how seriously to take this discrepancy. None of the differences are particularly huge, but there's enough of them that it bugs me that I can't account for them. I compared all values across prod and dev versions of assets in https://github.com/ccao-data/data-architecture/pull/641, so I don't think the issue is being caused by incorrect logic in the new reporting views, but I can't rule it out. Let me know what you think, I'm happy to do more investigation if you think it merits it.
Background
This PR fixes our broken
build-daily-dbt-models
workflow by improving build performance for thereporting.res_report_summary
table, which has been exhausting Athena query resources ever since we merged https://github.com/ccao-data/data-architecture/pull/641.The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as
reporting.res_report_summary
. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.Closes https://github.com/ccao-data/data-architecture/issues/653.
Tradeoffs
While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as
daily
and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.One idea to mitigate this downside might be to add a when-loaded column to
res_report_summary
and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since thebuild-and-test-dbt
workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.Alternatives I considered
Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.
Testing
See this workflow run for evidence that
build-daily-dbt-models
now runs successfully.I checked to make sure row counts match in the two assets:
Confirming the difference in certified values is entirely 2024 data:
Making sure the difference is all new towns that have closed since we last built
res_report_summary
in prod:I also spot checked values, and noticed that there are a number of towns that have slightly different sale or PIN counts between the dev and prod versions of
res_report_summary
. This can throw off the reported distributions for other values, since the extra data can shift the distribution very slightly. See this query for an example that looks only atmailed
andmodel
assessment stages:I'm not sure how seriously to take this discrepancy. None of the differences are particularly huge, but there's enough of them that it bugs me that I can't account for them. I compared all values across prod and dev versions of assets in https://github.com/ccao-data/data-architecture/pull/641, so I don't think the issue is being caused by incorrect logic in the new reporting views, but I can't rule it out. Let me know what you think, I'm happy to do more investigation if you think it merits it.