WITH tmp AS(
SELECT a.geotype, a.geoid, a.c, a.e, a.m, a.p, a.z, a.variable, b.c, b.e, b.m, b.p, b.z
FROM housing a
RIGHT JOIN staging.housing b
ON a.geotype = b.geotype
AND a.geogname = b.geogname
AND a.geoid = b.geoid
AND a.dataset = b.dataset
AND a.variable = b.variable
WHERE (b.e = b.p
AND b.e IS NOT NULL
AND a.p != b.p)
OR (b.m = b.z
AND b.m IS NOT NULL
AND a.z != b.z
)
)
SELECT geotype, variable, COUNT(*) FROM tmp
GROUP BY geotype, variable
ORDER BY COUNT(*) DESC;
After exploring the datasets, I noticed there are many records having the same value for e and p and only a couple of them have the same value for e and z.
Majority of these records come from geotype 'CT2010', some are from 'Boro2010' and a handful from 'City2010'
on average, only 5% of the total records are affected by this issue.
for the records having e = p, their p values are supposed to be '100'
geotype
'CT2010', some are from 'Boro2010' and a handful from 'City2010'