This PR fixes some subtle misbehavior in the PIN-to-PIN distance calculation CTAS queries.
Issue
proximity.dist_pin_to_pin contains occasional duplicate rows, such as PIN10 3120118009. This leads to duplicate rows in downstream views such as proximity.vw_pin10_proximity and the model views.
Cause
I initially thought the cause was somehow due to changing centroids over time. However, it seems that the duplicates are instead the result of some tricky anti-join behavior.
Let's look at 3120118009 as an example. This PIN exists in the results of the "first pass" CTAS with a buffer radius of 1km (proximity.dist_pin_to_pin_1km). However, it only has rows for 2004-2023 (even though the parcel data goes back to 2000). This is because prior to 2004, no PIN existed that was closer than 1km. Its nearest neighbor (3120118009) was created in 2004.
This becomes a problem due to the anti-join that feeds the "second stage" CTAS, as:
SELECT
pcl.pin10,
pcl.year,
pcl.x_3435,
pcl.y_3435,
dist_pin_to_pin_1km.pin10 AS matching_pin10
FROM spatial.parcel AS pcl
LEFT JOIN proximity.dist_pin_to_pin_1km AS dist_pin_to_pin_1km
ON pcl.pin10 = dist_pin_to_pin_1km.pin10
AND pcl.year = dist_pin_to_pin_1km.year
WHERE dist_pin_to_pin_1km.pin10 IS NULL
AND pcl.pin10 = '3120102002'
returns 4 rows (2000-2003) because the target PIN doesn't have matches for those years. These 4 rows get fed to the nearest_pin_neighbors() macro. That macro searches for nearest neighbors using the most recent input year's X,Y data as an origin. In the case of our test PIN, the most recent input year is 2003, so it searches from that year and matches to all parcel years.
HOWEVER, the parcel location of this PIN changes microscopically in 2005, which results in two different sets of distances: one from the 2023 origin in proximity.dist_pin_to_pin_1km and one from the 2003 origin in proximity.dist_pin_to_pin_10km.
The resulting rows aren't distinct and therefore do not get de-duplicated by the UNION in the dist_pin_to_pin view. FIN.
Solution
I did I super quick refactor to simplify the nearest_pin_neighbors() macro. I used the query planner hack that I discovered while building dist_to_nearest_geometry(). The result is a nearest PIN set that uses the coords for every year, not just the most recent parcel coords. The new solution covers every PIN and runs in about 30 minutes total.
This PR fixes some subtle misbehavior in the PIN-to-PIN distance calculation CTAS queries.
Issue
proximity.dist_pin_to_pin
contains occasional duplicate rows, such as PIN10 3120118009. This leads to duplicate rows in downstream views such asproximity.vw_pin10_proximity
and the model views.Cause
I initially thought the cause was somehow due to changing centroids over time. However, it seems that the duplicates are instead the result of some tricky anti-join behavior.
Let's look at 3120118009 as an example. This PIN exists in the results of the "first pass" CTAS with a buffer radius of 1km (
proximity.dist_pin_to_pin_1km
). However, it only has rows for 2004-2023 (even though the parcel data goes back to 2000). This is because prior to 2004, no PIN existed that was closer than 1km. Its nearest neighbor (3120118009) was created in 2004.This becomes a problem due to the anti-join that feeds the "second stage" CTAS, as:
returns 4 rows (2000-2003) because the target PIN doesn't have matches for those years. These 4 rows get fed to the
nearest_pin_neighbors()
macro. That macro searches for nearest neighbors using the most recent input year's X,Y data as an origin. In the case of our test PIN, the most recent input year is 2003, so it searches from that year and matches to all parcel years.HOWEVER, the parcel location of this PIN changes microscopically in 2005, which results in two different sets of distances: one from the 2023 origin in
proximity.dist_pin_to_pin_1km
and one from the 2003 origin inproximity.dist_pin_to_pin_10km
.The resulting rows aren't distinct and therefore do not get de-duplicated by the
UNION
in thedist_pin_to_pin
view. FIN.Solution
I did I super quick refactor to simplify the
nearest_pin_neighbors()
macro. I used the query planner hack that I discovered while buildingdist_to_nearest_geometry()
. The result is a nearest PIN set that uses the coords for every year, not just the most recent parcel coords. The new solution covers every PIN and runs in about 30 minutes total.Row Counts