censusreporter / census-postgres-scripts

Scripts used to set up census-postgres on an Amazon EC2 instance.
MIT License
65 stars 34 forks source link

Error assigning denominator_column_id in acs2017_1yr #19

Open JoeGermuska opened 6 years ago

JoeGermuska commented 6 years ago

Something about the load process led to the acs2017_1yr.census_table_metadata being off in a few cases.

Specifically, the value for denominator_column_id was incorrectly set. This column is used to indicate when it's sensible to represent values from the table as percentages of a whole -- its value should be null when values in a table represent medians, aggregates, means, or other values which don't work that way.

Its value should also be null for tables with only one column, such as B01003: Total Population among others.

I fixed this directly in the PostgreSQL database for the running system, but we should review, possibly implement some of the sanity checks below, and possibly create a new PSQL dump (lower priority)

Specifically, separate from the aggregate/mean/etc case, this query should, logically, always return zero rows:

SELECT tmd.* 
    FROM acs2017_1yr.census_table_metadata tmd, 
    (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) = 1) as cmd 
    WHERE tmd.table_id = cmd.table_id 
      AND tmd.denominator_column_id IS NOT NULL;

This command will ensure that the above command returns zero rows. It could probably be included as part of the data process after census_table_metadata is populated:

UPDATE acs2017_1yr.census_table_metadata tmd
SET denominator_column_id = null 
FROM (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) = 1) as cmd
    WHERE tmd.table_id = cmd.table_id
    AND denominator_column_id IS NOT NULL;

This command will show which tables don't have a denominator column, but have more than one row, and don't match the most obvious table naming conventions which are a sign that the data shouldn't be "ratio'd". It is likely to return a few rows, and for now all I can think is that a person can scan them and think about whether there's any likely issue...

SELECT tmd.* 
    FROM acs2017_1yr.census_table_metadata tmd, 
    (SELECT table_id, count(*) 
        FROM acs2017_1yr.census_column_metadata 
        GROUP BY table_id 
        HAVING count(*) > 1) as cmd 
    WHERE tmd.table_id =cmd.table_id 
      AND tmd.denominator_column_id IS NULL
      AND LOWER(table_title) NOT LIKE '%aggregate%'
      AND LOWER(table_title) NOT LIKE '%median%'
      AND LOWER(table_title) NOT LIKE '%mean%';