Closed AmandaDoyle closed 3 years ago
Why is this necessary? filtering is done both in python and in sql, we do filtering before data even gets into postgres. there are many different case by case scenarios, I don't see how getting this table is helpful because we can't compare anything
SELECT
a.datasource,
a.raw_record_counts,
b.final_record_counts,
a.raw_record_counts-b.final_record_counts as diff
FROM (
SELECT source as datasource, count(*) as raw_record_counts
FROM facdb_base GROUP BY source
) a LEFT JOIN (
SELECT datasource, count(*) as final_record_counts
FROM facdb GROUP BY datasource
) b ON a.datasource=b.datasource
@SPTKL Given the issues you flagged above of how filtering can happen even before the data are loaded into postgres let's go with comparing facdb_base to facdb, and we'll add documentation as to what it shows, and why it's limited.
My logic below and your logic above get the same result, so implement whichever you think is best.
SELECT a.source, COUNT(a.*), b.facdbcount,COUNT(a.*)-b.facdbcount as diff FROM facdb_base a
LEFT JOIN (SELECT datasource, COUNT(*) as facdbcount FROM facdb GROUP BY datasource) b
ON a.source=b.datasource
GROUP BY a.source, b.facdbcount
ORDER BY diff DESC;
incorporated
Report the # of records in the source file vs the number of records from the source that ended up in FacDB
This sounds like a pain to build. Let's scope it out before diving in.