Something about the load process led to the acs2017_1yr.census_table_metadata being off in a few cases.
Specifically, the value for denominator_column_id was incorrectly set. This column is used to indicate when it's sensible to represent values from the table as percentages of a whole -- its value should be null when values in a table represent medians, aggregates, means, or other values which don't work that way.
Its value should also be null for tables with only one column, such as B01003: Total Population among others.
I fixed this directly in the PostgreSQL database for the running system, but we should review, possibly implement some of the sanity checks below, and possibly create a new PSQL dump (lower priority)
Specifically, separate from the aggregate/mean/etc case, this query should, logically, always return zero rows:
SELECT tmd.*
FROM acs2017_1yr.census_table_metadata tmd,
(SELECT table_id, count(*)
FROM acs2017_1yr.census_column_metadata
GROUP BY table_id
HAVING count(*) = 1) as cmd
WHERE tmd.table_id = cmd.table_id
AND tmd.denominator_column_id IS NOT NULL;
This command will ensure that the above command returns zero rows. It could probably be included as part of the data process after census_table_metadata is populated:
UPDATE acs2017_1yr.census_table_metadata tmd
SET denominator_column_id = null
FROM (SELECT table_id, count(*)
FROM acs2017_1yr.census_column_metadata
GROUP BY table_id
HAVING count(*) = 1) as cmd
WHERE tmd.table_id = cmd.table_id
AND denominator_column_id IS NOT NULL;
This command will show which tables don't have a denominator column, but have more than one row, and don't match the most obvious table naming conventions which are a sign that the data shouldn't be "ratio'd". It is likely to return a few rows, and for now all I can think is that a person can scan them and think about whether there's any likely issue...
SELECT tmd.*
FROM acs2017_1yr.census_table_metadata tmd,
(SELECT table_id, count(*)
FROM acs2017_1yr.census_column_metadata
GROUP BY table_id
HAVING count(*) > 1) as cmd
WHERE tmd.table_id =cmd.table_id
AND tmd.denominator_column_id IS NULL
AND LOWER(table_title) NOT LIKE '%aggregate%'
AND LOWER(table_title) NOT LIKE '%median%'
AND LOWER(table_title) NOT LIKE '%mean%';
Something about the load process led to the
acs2017_1yr.census_table_metadata
being off in a few cases.Specifically, the value for
denominator_column_id
was incorrectly set. This column is used to indicate when it's sensible to represent values from the table as percentages of a whole -- its value should be null when values in a table represent medians, aggregates, means, or other values which don't work that way.Its value should also be null for tables with only one column, such as
B01003: Total Population
among others.I fixed this directly in the PostgreSQL database for the running system, but we should review, possibly implement some of the sanity checks below, and possibly create a new PSQL dump (lower priority)
Specifically, separate from the aggregate/mean/etc case, this query should, logically, always return zero rows:
This command will ensure that the above command returns zero rows. It could probably be included as part of the data process after
census_table_metadata
is populated:This command will show which tables don't have a denominator column, but have more than one row, and don't match the most obvious table naming conventions which are a sign that the data shouldn't be "ratio'd". It is likely to return a few rows, and for now all I can think is that a person can scan them and think about whether there's any likely issue...