Closed SashaWeinstein closed 2 years ago
I think what we want is IS DISTINCT FROM
, instead of =
https://wiki.postgresql.org/wiki/Is_distinct_from
Next step is run this on one column and see if it displays expected behavior
I made some progress on this upgrade for categorical data. Will work on continuous fields next.
I looked at ownername to test out process for categorical data. The old version of the mismatch table identified 27904 mismatches in this field between versions 21v3 and 21v4.
Before changing the query to generate the mismatch table, I ran a set of test queries to get the expected behavior. So first I ran
select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new
inner join dcp_pluto prev on new.ownername != prev.ownername and
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint;
This returned 27904 records, as expected.
Then I updated the comparison operator to IS DISTINCT FROM
select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new
inner join dcp_pluto prev on new.ownername IS DISTINCT FROM prev.ownername and
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint;
This query returned 27976 records. This suggests there are 72 (27976 minus 27904) records where ownername is null in 21v3 or 21v4. To double check this I ran
select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new
inner join dcp_pluto prev on new.ownername IS DISTINCT FROM prev.ownername and
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint
WHERE new.ownername is NULL OR prev.ownername is NUll;
And sure enough it returned 72 records. As a final check I ran
select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new
inner join dcp_pluto prev on new.ownername != prev.ownername and
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint
WHERE new.ownername is NULL OR prev.ownername is NUll;
which as expected returned 0 records. (When either ownername is null the != operator always returns true)
So now I'm ready to implement IS DISTINCT FROM
in the qaqc_mismatch.sql
code
Next I looked at the code in qaqc_mismatch.sql
. The code to generate fields looks like
count(nullif(a.ownername = b.ownername, true)) as ownername,
I'm pretty sure the nullif
statement was implemented based on the assumption that the =
operator returns NULL
if either value is NULL
. When we switch to IS DISTINCT FROM
the nullif
can come out.
I found this stackoverflow post with some nice readable syntax I chose to implement. I changed the line to
count(*) FILTER (WHERE a.ownername IS DISTINCT FROM b.ownername) as ownername,
which returns 27976 ownername mismatches!
Success on the categorical changes. Will make this change for all categorical and worry about numeric next
There are type mismatches between the previous and current versions which I address with casting in the qaqc_mismatch.sql
code. There may be a better way to do this casting somewhere upstream, but seeing that we don't know what our final QAQC process will look like I think it makes sense to cast in the query itself for now
Some values for ct2010
in the pluto build on my local machine have decimals, values like 317.03 or 15.01. Not sure what's going on there
I have a roadblock that I need an answer to before moving forward. For continuous fields such as lotarea
, bldgarea
, numbldgs
, lotfront
, etc, do we count 0 as the same as null? That would make the logic much simpler.
There are many continuous fields and maybe we want different logic for each, a lot area of 0 is quite different from a commerical area of 0. Think it probably makes sense to talk about at standup today
I have sql code that works for the continuous comparison. My approach is this
count(*) FILTER (WHERE abs(a.lotarea::int - b.lotarea::int)>=5 OR
((a.lotarea IS NULL)::int + (b.lotarea IS NULL)::int) = 1) as lotarea
The good news is that it's readable, filter for records where there is a 5 or greater difference or where there is exactly one NULL between the two columns. The issue is that it's quite long and I think my lack of postgres skills is keeping me from finding a more succinct solution. I'm curious @AmandaDoyle if you know of a common postgres built-in or pattern that would clean this up?
@SashaWeinstein
Re ct2010
- yes, decimals are correct. There are 2 weird census tract fiends in PLUTO, which you can read about here. We're not exactly sure about the origin or use but didn't feel as if we could deprecate them.
Despite you're solution above feeling long it works and is the most concise. The alternative solutions I have are longer.
qaqc_mismatch.sql
doesn't catch situations where one value is null and the other is an actual value. We want to count these as a mismatch