censusreporter / censusreporter

Census Reporter is a Knight News Challenge-funded project to make it easier for journalists to write stories using information from the U.S. Census bureau.
http://censusreporter.org
MIT License
741 stars 140 forks source link

More accurate names for block groups #58

Open JoeGermuska opened 10 years ago

JoeGermuska commented 10 years ago

On UserVoice, a user pointed out that block group numbers are not unique within counties. Therefore, our names are not unique (see screen shot).

image

Ideally as part of loading the ACS 2013-5 year data, we should change the name construction to include the Census tract number.

JoeGermuska commented 5 years ago

This came up recently with a UserVoice comment about the data exports, but it remains an issue for that community analyzing Block Group data.

JoeGermuska commented 5 years ago

Proposed update, based on this example:

Tract 8067, Block Group 1, Cook County, IL

JoeGermuska commented 4 years ago

A bit harder than it looks because the name is set via SQL query against the TIGER BG table, which has only the six "digit" tractce value. It would take clever PostgreSQL, a stored procedure, or a second pass on the data to deal with leading 0 and the fact that, in print, the last two characters of tractce appear after a decimal point, and only if they are non-zero.

https://github.com/censusreporter/census-postgres-scripts/blob/master/13_index_tiger_2018.sql/#L281-L296

We could join with tract and either use NAME or NAMELSAD, and some simple PostgreSQL string manipulation.

Alternatively, this stored procedure produces the exact same value as NAME without needing to join:

CREATE FUNCTION format_tract(tract_ce text)
  RETURNS text
AS $$
  if tract_ce.endswith('00'):
    return tract_ce.lstrip('0')[:-2]
  return tract_ce[:4].lstrip('0') + '.' + tract_ce[-2:]
$$ LANGUAGE plpythonu;

@iandees do you have an opinion? Do you think we could close this with the December ACS 2015-2019 5-year data load?

iandees commented 4 years ago

I'd rather not use stored procedures as it makes it harder to maintain. If we want to run Python we can just run it over the CSV before we load it into Postgres.