Fix `reporting.res_report_summary` builds by turning input CTEs to tables

Background

This PR fixes our broken build-daily-dbt-models workflow by improving build performance for the reporting.res_report_summary table, which has been exhausting Athena query resources ever since we merged https://github.com/ccao-data/data-architecture/pull/641.

The approach in this PR is to take the CTEs that the query reuses for aggregation and split them out into input tables that we materialize on the same schedule as reporting.res_report_summary. Tables improve performance over CTEs in this case because Athena treats CTEs the same as subqueries and views, that is to say, it will rerun the underlying query code every time the CTE is referenced.

Closes https://github.com/ccao-data/data-architecture/issues/653.

Tradeoffs

While materializing the input data to the model speeds it up and reduces its memory usage, it also requires that we remember to refresh the inputs any time we want to refresh the model. This should not be a problem for the daily build task, since we can just tag the inputs as daily and dbt will understand the correct order to run the models, but it might be a gotcha if someone is ever asked to refresh this table on command.

One idea to mitigate this downside might be to add a when-loaded column to res_report_summary and its inputs, then add a test to the model confirming that the when-loaded column for its inputs are within some small bound (say, 15 minutes). This could provide a layer of protection against the models getting out of sync since the build-and-test-dbt workflow that we use to rebuild production models also runs tests for any models it builds. If you think this is a good idea, I'll go ahead and set it up before merging.

Alternatives I considered

Other alternatives I considered was to improve the memory use of the query itself without factoring out any extra tables. The most memory-intensive part of the query is the repeated unions that aggregate from the source data at varying levels of granularity, and I spent a while brainstorming ways of doing this without having to aggregate different groups and union them, but I couldn't think of anything better.

Testing

See this workflow run for evidence that build-daily-dbt-models now runs successfully.

I checked to make sure row counts match in the two assets:

select
    assessment_stage,
    prod_count,
    dev_count
from (
    select
        assessment_stage,
        count(*) as prod_count
    from reporting.res_report_summary
    group by assessment_stage
) prod
left join (
    select
        assessment_stage,
        count(*) as dev_count
    from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
    where assessment_stage in ('ASSESSOR CERTIFIED', 'MAILED', 'BOR CERTIFIED', 'model')
    group by assessment_stage
) dev
using (assessment_stage)

#	assessment_stage	prod_count	dev_count
1	MAILED	12306	12306
2	BOR CERTIFIED	9329	9329
3	model	8920	8920
4	ASSESSOR CERTIFIED	11259	11562

Confirming the difference in certified values is entirely 2024 data:

select year, assessment_stage, count(*) as count
from (
    select triad, geography_type, geography_id, property_group, assessment_stage, year
    from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
    except 
    select triad, geography_type, geography_id, property_group, assessment_stage, year from reporting.res_report_summary
) comparison
group by year, assessment_stage

#	year	assessment_stage	count
1	2024	ASSESSOR PRE-CERTIFIED	379
2	2024	ASSESSOR CERTIFIED	303

Making sure the difference is all new towns that have closed since we last built res_report_summary in prod:

select distinct geography_type, geography_id
from (
    select triad, geography_type, geography_id, property_group, assessment_stage, year
    from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary
    except 
    select triad, geography_type, geography_id, property_group, assessment_stage, year from reporting.res_report_summary
) comparison
where geography_type = 'Town' and assessment_stage = 'ASSESSOR CERTIFIED'
order by geography_id

#	assessment_stage	count
1	Town	37
2	Town	72

I also spot checked values, and noticed that there are a number of towns that have slightly different sale or PIN counts between the dev and prod versions of res_report_summary. This can throw off the reported distributions for other values, since the extra data can shift the distribution very slightly. See this query for an example that looks only at mailed and model assessment stages:

select
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year,
    dev.pin_n as dev_pin_n,
    prod.pin_n as prod_pin_n,
    abs(coalesce(dev.pin_n, 0) - coalesce(prod.pin_n, 0)) as pin_n_mismatch,
    dev.sale_n as dev_sale_n,
    prod.sale_n as prod_sale_n,
    abs(coalesce(dev.sale_n, 0) - coalesce(prod.sale_n, 0)) as sale_n_mismatch
from z_ci_jeancochrane_653_refactor_res_report_summary_to_fix_failing_daily_dbt_model_builds_reporting.res_report_summary dev
full outer join reporting.res_report_summary prod
using ( 
    triad,
    geography_type,
    geography_id,
    property_group,
    assessment_stage,
    year
)
where assessment_stage not in ('ASSESSOR PRE-CERTIFIED', 'ASSESSOR CERTIFIED')
and (
  coalesce(dev.pin_n, 0) != coalesce(prod.pin_n, 0) or
  coalesce(dev.sale_n, 0) != coalesce(prod.sale_n, 0)
)

I'm not sure how seriously to take this discrepancy. None of the differences are particularly huge, but there's enough of them that it bugs me that I can't account for them. I compared all values across prod and dev versions of assets in https://github.com/ccao-data/data-architecture/pull/641, so I don't think the issue is being caused by incorrect logic in the new reporting views, but I can't rule it out. Let me know what you think, I'm happy to do more investigation if you think it merits it.

ccao-data / data-architecture