NYCPlanning / db-facilities

🏭 🏢 🏬 🏣 🏤 🏥 🏦 🏨 🏪 🏫 🏩
https://nycplanning.github.io/db-facilities
0 stars 0 forks source link

QAQC: Report the # of records in the source file vs the number of records from the source that ended up in FacDB #465

Closed AmandaDoyle closed 3 years ago

AmandaDoyle commented 3 years ago

Report the # of records in the source file vs the number of records from the source that ended up in FacDB

This sounds like a pain to build. Let's scope it out before diving in.

SPTKL commented 3 years ago

Why is this necessary? filtering is done both in python and in sql, we do filtering before data even gets into postgres. there are many different case by case scenarios, I don't see how getting this table is helpful because we can't compare anything

SPTKL commented 3 years ago
SELECT 
    a.datasource,
    a.raw_record_counts,
    b.final_record_counts,
    a.raw_record_counts-b.final_record_counts as diff
FROM (
    SELECT source as datasource, count(*) as raw_record_counts
    FROM facdb_base GROUP BY source
) a LEFT JOIN (
    SELECT datasource, count(*) as final_record_counts
    FROM facdb GROUP BY datasource
) b ON a.datasource=b.datasource
AmandaDoyle commented 3 years ago

@SPTKL Given the issues you flagged above of how filtering can happen even before the data are loaded into postgres let's go with comparing facdb_base to facdb, and we'll add documentation as to what it shows, and why it's limited.

My logic below and your logic above get the same result, so implement whichever you think is best.

SELECT a.source, COUNT(a.*), b.facdbcount,COUNT(a.*)-b.facdbcount as diff FROM facdb_base a 
LEFT JOIN (SELECT datasource, COUNT(*) as facdbcount FROM facdb GROUP BY datasource) b
 ON a.source=b.datasource
 GROUP BY a.source, b.facdbcount
 ORDER BY diff DESC;
SPTKL commented 3 years ago

incorporated