ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Fix duplicates in PIN proximity view #302

Closed dfsnow closed 5 months ago

dfsnow commented 5 months ago

This PR fixes some subtle misbehavior in the PIN-to-PIN distance calculation CTAS queries.

Issue

proximity.dist_pin_to_pin contains occasional duplicate rows, such as PIN10 3120118009. This leads to duplicate rows in downstream views such as proximity.vw_pin10_proximity and the model views.

Cause

I initially thought the cause was somehow due to changing centroids over time. However, it seems that the duplicates are instead the result of some tricky anti-join behavior.

Let's look at 3120118009 as an example. This PIN exists in the results of the "first pass" CTAS with a buffer radius of 1km (proximity.dist_pin_to_pin_1km). However, it only has rows for 2004-2023 (even though the parcel data goes back to 2000). This is because prior to 2004, no PIN existed that was closer than 1km. Its nearest neighbor (3120118009) was created in 2004.

This becomes a problem due to the anti-join that feeds the "second stage" CTAS, as:

SELECT
    pcl.pin10,
    pcl.year,
    pcl.x_3435,
    pcl.y_3435,
    dist_pin_to_pin_1km.pin10 AS matching_pin10
FROM spatial.parcel AS pcl
LEFT JOIN proximity.dist_pin_to_pin_1km AS dist_pin_to_pin_1km
    ON pcl.pin10 = dist_pin_to_pin_1km.pin10
    AND pcl.year = dist_pin_to_pin_1km.year
WHERE dist_pin_to_pin_1km.pin10 IS NULL
AND pcl.pin10 = '3120102002'

returns 4 rows (2000-2003) because the target PIN doesn't have matches for those years. These 4 rows get fed to the nearest_pin_neighbors() macro. That macro searches for nearest neighbors using the most recent input year's X,Y data as an origin. In the case of our test PIN, the most recent input year is 2003, so it searches from that year and matches to all parcel years.

HOWEVER, the parcel location of this PIN changes microscopically in 2005, which results in two different sets of distances: one from the 2023 origin in proximity.dist_pin_to_pin_1km and one from the 2003 origin in proximity.dist_pin_to_pin_10km.

The resulting rows aren't distinct and therefore do not get de-duplicated by the UNION in the dist_pin_to_pin view. FIN.

Solution

I did I super quick refactor to simplify the nearest_pin_neighbors() macro. I used the query planner hack that I discovered while building dist_to_nearest_geometry(). The result is a nearest PIN set that uses the coords for every year, not just the most recent parcel coords. The new solution covers every PIN and runs in about 30 minutes total.

Row Counts

year parcel_count old_dist_1km_count old_dist_10km_count old_dist_total_count new_dist_100m_count new_dist_500m_count new_dist_10km_count new_dist_total_count
2023 1417110 1412213 2101 1414304 876066 532390 8654 1417110
2022 1416347 1411441 2097 1413529 875453 532212 8682 1416347
2021 1415410 1411079 2089 1413157 874689 532056 8665 1415410
2020 1415664 1410810 2076 1412865 875645 531354 8665 1415664
2019 1415193 1410353 2079 1412408 875273 531253 8667 1415193
2018 1414731 1410106 2050 1412132 875081 530988 8662 1414731
2017 1414896 1410312 2045 1412333 875456 530780 8660 1414896
2016 1414449 1410026 2026 1412026 875442 530346 8661 1414449
2015 1414155 1409565 2085 1411613 873923 531576 8656 1414155
2014 1413788 1409117 2084 1411165 873683 531464 8641 1413788
2013 1413759 1409044 2084 1411088 873705 531406 8648 1413759
2012 1413732 1408914 2078 1410949 873718 531380 8634 1413732
2011 1413698 1408981 2078 1411013 874020 531034 8644 1413698
2010 1413699 1408982 2078 1411014 874024 531031 8644 1413699
2009 1413740 1408992 2068 1411015 874301 530801 8638 1413740
2008 1413206 1408946 2063 1410960 874462 530101 8643 1413206
2007 1410605 1406484 2073 1408497 873262 528673 8670 1410605
2006 1408738 1404744 2047 1406729 872448 527585 8705 1408738
2005 1405047 1401268 2008 1403219 870787 525524 8736 1405047
2004 1401506 1397916 2008 1399855 869513 523200 8793 1401506
2003 1398092 1394607 2009 1396551 868510 520731 8851 1398092
2002 1395387 1391982 1994 1393919 867278 519238 8871 1395387
2001 1393174 1389845 1996 1391783 866783 517524 8867 1393174
2000 1391620 1388355 1986 1390284 866255 516449 8916 1391620