ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Decompose reporting views/ctas and implement dbt tests #367

Closed wrridgeway closed 2 months ago

wrridgeway commented 3 months ago

Some of the reporting views and CTAS are rather unwieldy - let's decompose their CTEs into views that can be tested using dbt. Other views can be refactored using views from our default athena database to decrease the amount of code used.

CTAS to refactor/decompose: reporting.res_report_summary

Views to refactor/decompose: reporting.vw_ratio_stats reporting.vw_assessment_roll reporting.vw_top_5 reporting.vw_town_sale_history

Views to add: reporting.vw_pin_value_long reporting.vw_pin_township_class

This PR primarily shifts construction of common reporting groups out of each reporting view/CTAS and into vw_pin_township_class, which can be used to attach all common reporting groups to any pin/year combo. It also constructs a long version of vw_pin_value since multiple reporting views construct output at a stage-level.

wrridgeway commented 2 months ago

vw_assessment_roll

WITH old AS (
    SELECT
    year,
    stage,
    class,
    count(*) AS count,
    avg(tot_sum) AS "avg"
    FROM reporting.vw_assessment_roll
    GROUP BY year, stage, class
),
new AS (
    SELECT
    year,
    stage,
    class,
    count(*) AS count,
    avg(tot_sum) AS "avg"
    FROM "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_assessment_roll
    GROUP BY year, stage, class
)

SELECT * FROM old
EXCEPT
SELECT * FROM new

Yields:

year stage class count avg
2023 assessor certified O 1 0
2023 mailed O 1 0
2023 bor certified O 1 0

Looking at 2023 more closely, we see the differences are cause by one parcel switching from class 'O' to '2'. This is caused by a disagreement between pardat (old source for classes) and asmt_all (new source for classes).

SELECT
old.year,
old.stage,
old.class,
old.township_name,
old.n AS old_n,
new.n AS new_n
FROM reporting.vw_assessment_roll old
LEFT JOIN  "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_assessment_roll new
ON old.year = new.year
    AND old.stage = new.stage
    AND old.class = new.class
    AND old.township_name = new.township_name
WHERE old.n != new.n
OR new.n IS NULL
year stage class township_name old_n new_n
2023 bor certified O Riverside 1  
2023 mailed 2 Riverside 5908 5909
2023 assessor certified 2 Riverside 5908 5909
2023 bor certified 2 Riverside 5908 5909
2023 mailed O Riverside 1  
2023 assessor certified O Riverside 1  

All AVs across the two views match, except for 2023 class 2 values because of this one parcel. EX AVs match because adding another exempt parcel to the group won't change sums, medians, etc.

wrridgeway commented 2 months ago

vw_top_5

Return any rows that don't join by pin and year across the old and new views.

SELECT * FROM reporting.vw_top_5 old
FULL OUTER JOIN "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_top_5 new
ON old.parid = new.pin
    AND old.year = new.year
WHERE old.parid IS NULL
    OR new.pin IS NULL

Doesn't return any rows that don't join. Are there any differences between the views even if they have the same pins by year?

WITH differences AS (
    SELECT * FROM reporting.vw_top_5 old
    EXCEPT
SELECT * FROM "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_top_5 new
)

SELECT
    old.total_av AS old_av,
    new.total_av AS new_av,
    old.address AS old_address,
    new.address AS new_address
FROM reporting.vw_top_5 old
FULL OUTER JOIN "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_top_5 new
ON old.parid = new.pin
    AND old.year = new.year
WHERE old.parid IN (SELECT parid FROM differences)

Yes. All of the values are the same, but addresses can vary. New addresses come from vw_pin_address rather than being constructed within vw_top_5 itself:

old_av new_av old_address new_address
7685953 7685953 0 UNKNOWN UNKNOWN
2725266 2725266 0 183RD ST 183RD ST
1640318 1640318 0 UNKNOWN UNKNOWN
5419988 5419988 0 UNKNOWN UNKNOWN
3134002 3134002 0 UNKNOWN UNKNOWN
7855492 7855492 0 W DUNDEE RD W DUNDEE RD
5735807 5735807 0 UNKNOWN UNKNOWN
5735807 5735807 0 UNKNOWN UNKNOWN
8744961 8744961 0 UNKNOWN UNKNOWN
7080684 7080684 0 UNKNOWN UNKNOWN
2999809 2999809 0 183RD ST 183RD ST
2725266 2725266 0 183RD ST 183RD ST
2355590 2355590 0 N OAK PARK AVE N OAK PARK AVE
4821389 4821389 0 RIDGE RD RIDGE RD
4005278 4005278 0 UNKNOWN UNKNOWN
66090511 66090511 0 UNKNOWN UNKNOWN
5165229 5165229 0 UNKNOWN UNKNOWN
1966845 1966845 0 UNKNOWN UNKNOWN
8828665 8828665 0 P O BOX A-3879 P O BOX A-3879
1966845 1966845 0 UNKNOWN UNKNOWN
8828665 8828665 0 P O BOX A-3879 P O BOX A-3879
7714497 7714497 0 P O BOX A-3879 P O BOX A-3879
7714497 7714497 0 P O BOX A-3879 P O BOX A-3879
2355590 2355590 0 N OAK PARK AVE N OAK PARK AVE
2626619 2626619 0 N OAK PARK AVE N OAK PARK AVE
2355590 2355590 0 N OAK PARK AVE N OAK PARK AVE
1894606 1894606 0 N OAK PARK AVE N OAK PARK AVE
26834144 26834144 0 LAKE COOK RD LAKE COOK RD
36691499 36691499 0 LAKE COOK RD LAKE COOK RD
35896968 35896968 0 LAKE COOK RD LAKE COOK RD
24035764 24035764 0 LAKE COOK RD LAKE COOK RD
36691499 36691499 0 LAKE COOK RD LAKE COOK RD
33978930 33978930 0 LAKE COOK RD LAKE COOK RD
42231002 42231002 0 LAKE COOK RD LAKE COOK RD
33978930 33978930 0 LAKE COOK RD LAKE COOK RD
26651297 26651297 0 LAKE COOK RD LAKE COOK RD
35896971 35896971 0 LAKE COOK RD LAKE COOK RD
38970697 38970697 0 LAKE COOK RD LAKE COOK RD
26210094 26210094 0 LAKE COOK RD LAKE COOK RD
24035765 24035765 0 LAKE COOK RD LAKE COOK RD
14401790 14401790 1601 S CICERO AVE 1601 S CICERO AVE
14932324 14932324 1601 S CICERO AVE 1601 S CICERO AVE
4343773 4343773 0 UNKNOWN UNKNOWN
5401103 5401103 0 CICERO CICERO
4343773 4343773 0 UNKNOWN UNKNOWN
wrridgeway commented 2 months ago

vw_town_sale_history

Aggregate old and new views to count number of rows and average sale_n column by year. Compare aggregations and return and rows that disagree or don't join.

WITH old AS (
    SELECT
        sale_year,
        count(*) AS old_count,
        avg(sale_n) AS old_sales
    FROM reporting.vw_town_sale_history
    GROUP BY sale_year
),

new AS (
    SELECT
        sale_year,
        count(*) AS new_count,
        avg(sale_n) AS new_sales
    FROM "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_town_sale_history
    GROUP BY sale_year
)

SELECT
    old.sale_year,
    old_count,
    new_count,
    old_sales,
    new_sales
FROM old
FULL OUTER JOIN new
    ON old.sale_year = new.sale_year
WHERE old_count != new_count
    OR old_sales != new_sales
    OR old_count IS NULL
    OR new_count IS NULL

Yields one row for 1973:

sale_year old_count new_count old_sales new_sales
1973 2   1  

This difference is cause by using vw_pin_township_class as the universe of pins by year for the new view rather than vw_pin_sale.

wrridgeway commented 2 months ago

vw_ratio_stats

Compare counts and average ratios across years

WITH new AS (
    SELECT
        year,
        count(*) AS new_n,
        avg(ratio) AS new_ratio
    FROM "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".vw_ratio_stats
 GROUP BY year
),
OLD AS (
    SELECT
        year,
        count(*) AS old_n,
        avg(ratio) AS old_ratio
  FROM reporting.vw_ratio_stats
  GROUP BY year
)

SELECT
    new.year,
    new_n,
    old_n,
    new_ratio,
    old_ratio
FROM new
FULL OUTER JOIN old
    ON new.year = old.year
year new_n old_n new_ratio old_ratio
2024 57008 57008 1.087891 1.087891
2023 231949 231949 0.949966 0.949966
2022 300219 300219 0.940188 0.940188
2021 205368 205368 0.949351 0.949351
wrridgeway commented 2 months ago

res_report_summary

We can't compare medians very well since they aren't deterministic, but we can compare other non-median columsn across the new and old CTAS (yrblt is a median column which explains the very small diffs).

WITH new AS (
    SELECT
        year,
        LOWER(assessment_stage) AS assessment_stage,
        count(*) AS new_n,
        avg(pin_n) AS new_group_count,
        avg(sale_n) AS new_sale_count,
        avg(sale_max) AS new_sale_max,
        avg(CAST(sale_year AS decimal)) AS new_sale_year,
        avg(yrblt_median) as new_yrblt
    FROM "z_ci_366-decompose-reporting-viewsctas-and-implmenet-dbt-tests_reporting".res_report_summary
 GROUP BY year, assessment_stage
),
OLD AS (
    SELECT
        year,
        LOWER(assessment_stage) AS assessment_stage,
        count(*) AS old_n,
        avg(pin_n) AS old_group_count,
        avg(sale_n) AS old_sale_count,
        avg(sale_max) AS old_sale_max,
        avg(CAST(sale_year AS decimal)) AS old_sale_year,
        avg(yrblt_median) as old_yrblt
  FROM reporting.res_report_summary
  GROUP BY year, assessment_stage
)

SELECT
    new.year,
    new.assessment_stage,
    new_n,
    old_n,
    new_group_count,
    old_group_count,
    new_sale_count,
    old_sale_count,
    new_sale_max,
    old_sale_max,
    new_sale_year,
    old_sale_year,
    old_yrblt,
    new_yrblt
FROM new
FULL OUTER JOIN old
    ON new.year = old.year
    AND new.assessment_stage = old.assessment_stage
year assessment_stage new_n old_n new_group_count old_group_count new_sale_count old_sale_count new_sale_max old_sale_max new_sale_year old_sale_year old_yrblt new_yrblt
2021 bor certified 2950 2950 1987.445 1987.445 96.85544 96.85544 646079.2 646079.2 2020 2020 1959.09 1959.09
2024 model 2952 2952 1983.967 1983.966 75.84529 75.84529 791360.9 791360.9 2023 2023 1959.138 1959.127
2023 assessor certified 2952 2952 1984.211 1984.211 98.52804 98.52804 777477.1 777477.1 2022 2022 1959.097 1959.1
2021 mailed 2950 2950 1987.364 1987.364 96.85544 96.85544 646079.2 646079.2 2020 2020 1959.085 1959.092
2023 mailed 2952 2952 1984.211 1984.211 98.52804 98.52804 777477.1 777477.1 2022 2022 1959.1 1959.099
2024 mailed 268 268 1301.164 1301.164 52.29268 52.29268 920229.9 920229.9 2023 2023 1938.511 1938.501
2023 bor certified 1277 1277 1467.787 1467.787 72.10644 72.10644 777222.7 777222.7 2022 2022 1962.209 1962.205
2022 assessor certified 3352 3352 1748.9 1748.9 106.5958 106.5958 724392.8 724392.8 2021 2021 1957.638 1957.637
2023 model 2947 2947 1982.579 1982.579 98.52804 98.52804 777477.1 777477.1 2022 2022 1959.126 1959.116
2022 model 2947 2947 1984.309 1984.309 106.5958 106.5958 724392.8 724392.8 2021 2021 1959.097 1959.103
2021 assessor certified 2950 2950 1987.445 1987.445 96.85544 96.85544 646079.2 646079.2 2020 2020 1959.09 1959.091
2022 mailed 3352 3352 1748.9 1748.9 106.5958 106.5958 724392.8 724392.8 2021 2021 1957.636 1957.638
2022 bor certified 3352 3352 1747.823 1747.823 106.5958 106.5958 724392.8 724392.8 2021 2021 1957.632 1957.628
wrridgeway commented 2 months ago

@dfsnow As you mentioned, it would be a good idea to use ccao.class_dict in lieu of the manual coding in vw_pin_township_class. But ccao.class_dict needs some updating before that can happen - we need to make sure it incorporates all possible classes from both pardat and asmt_all (they don't overlap completely) and we need to update it to accommodate class codes having A and B endings now (should have anticipated this when adding OA classes).

I think updating ccao.class_dict and refactoring reporting.vw_pin_township_class to use it would be a good PR on its own (it'd have to be pretty fast on the heels of https://github.com/ccao-data/data-architecture/pull/353, or precede it).

dfsnow commented 2 months ago

@dfsnow As you mentioned, it would be a good idea to use ccao.class_dict in lieu of the manual coding in vw_pin_township_class. But ccao.class_dict needs some updating before that can happen - we need to make sure it incorporates all possible classes from both pardat and asmt_all (they don't overlap completely) and we need to update it to accommodate class codes having A and B endings now (should have anticipated this when adding OA classes).

I think updating ccao.class_dict and refactoring reporting.vw_pin_township_class to use it would be a good PR on its own (it'd have to be pretty fast on the heels of #353, or precede it).

Excellent! @wrridgeway why don't you make a separate fast follow issue for this once #353 and this PR are merged.