NYCPlanning / db-pluto

🔭 PLUTO and MapPLUTO
https://nycplanning.github.io/db-pluto
38 stars 13 forks source link

QAQC: Count `null` as different from existing value #316

Closed SashaWeinstein closed 2 years ago

SashaWeinstein commented 2 years ago

qaqc_mismatch.sql doesn't catch situations where one value is null and the other is an actual value. We want to count these as a mismatch

SashaWeinstein commented 2 years ago

I think what we want is IS DISTINCT FROM, instead of = https://wiki.postgresql.org/wiki/Is_distinct_from Next step is run this on one column and see if it displays expected behavior

SashaWeinstein commented 2 years ago

I made some progress on this upgrade for categorical data. Will work on continuous fields next.

I looked at ownername to test out process for categorical data. The old version of the mismatch table identified 27904 mismatches in this field between versions 21v3 and 21v4.

Before changing the query to generate the mismatch table, I ran a set of test queries to get the expected behavior. So first I ran

select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new 
inner join dcp_pluto prev on new.ownername != prev.ownername and 
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint;

This returned 27904 records, as expected.

Then I updated the comparison operator to IS DISTINCT FROM

select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new 
inner join dcp_pluto prev on new.ownername IS DISTINCT FROM prev.ownername and 
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint;

This query returned 27976 records. This suggests there are 72 (27976 minus 27904) records where ownername is null in 21v3 or 21v4. To double check this I ran

select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new 
inner join dcp_pluto prev on new.ownername IS DISTINCT FROM prev.ownername and 
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint 
WHERE new.ownername is NULL OR prev.ownername is NUll;

And sure enough it returned 72 records. As a final check I ran

select new.bbl, new.ownername as new_name, prev.ownername as prev_name from export_pluto new 
inner join dcp_pluto prev on new.ownername != prev.ownername and
ROUND(new.bbl::float):: bigint = ROUND(prev.bbl::float)::bigint 
WHERE new.ownername is NULL OR prev.ownername is NUll;

which as expected returned 0 records. (When either ownername is null the != operator always returns true)

So now I'm ready to implement IS DISTINCT FROM in the qaqc_mismatch.sql code

SashaWeinstein commented 2 years ago

Next I looked at the code in qaqc_mismatch.sql. The code to generate fields looks like

count(nullif(a.ownername = b.ownername, true)) as ownername,

I'm pretty sure the nullif statement was implemented based on the assumption that the = operator returns NULL if either value is NULL. When we switch to IS DISTINCT FROM the nullif can come out.

I found this stackoverflow post with some nice readable syntax I chose to implement. I changed the line to

count(*) FILTER (WHERE a.ownername IS DISTINCT FROM b.ownername) as ownername,

which returns 27976 ownername mismatches!

Success on the categorical changes. Will make this change for all categorical and worry about numeric next

SashaWeinstein commented 2 years ago

There are type mismatches between the previous and current versions which I address with casting in the qaqc_mismatch.sql code. There may be a better way to do this casting somewhere upstream, but seeing that we don't know what our final QAQC process will look like I think it makes sense to cast in the query itself for now

SashaWeinstein commented 2 years ago

Some values for ct2010 in the pluto build on my local machine have decimals, values like 317.03 or 15.01. Not sure what's going on there

SashaWeinstein commented 2 years ago

I have a roadblock that I need an answer to before moving forward. For continuous fields such as lotarea, bldgarea, numbldgs, lotfront, etc, do we count 0 as the same as null? That would make the logic much simpler. There are many continuous fields and maybe we want different logic for each, a lot area of 0 is quite different from a commerical area of 0. Think it probably makes sense to talk about at standup today

SashaWeinstein commented 2 years ago

I have sql code that works for the continuous comparison. My approach is this

count(*) FILTER (WHERE abs(a.lotarea::int - b.lotarea::int)>=5 OR 
    ((a.lotarea IS NULL)::int + (b.lotarea IS NULL)::int) = 1) as lotarea

The good news is that it's readable, filter for records where there is a 5 or greater difference or where there is exactly one NULL between the two columns. The issue is that it's quite long and I think my lack of postgres skills is keeping me from finding a more succinct solution. I'm curious @AmandaDoyle if you know of a common postgres built-in or pattern that would clean this up?

AmandaDoyle commented 2 years ago

@SashaWeinstein Re ct2010 - yes, decimals are correct. There are 2 weird census tract fiends in PLUTO, which you can read about here. We're not exactly sure about the origin or use but didn't feel as if we could deprecate them.

Despite you're solution above feeling long it works and is the most concise. The alternative solutions I have are longer.