censusreporter / census-postgres-scripts

Scripts used to set up census-postgres on an Amazon EC2 instance.
MIT License
65 stars 34 forks source link

geoheader for 2009_5yr has an invalid character encoding #15

Open peterwang998 opened 8 years ago

peterwang998 commented 8 years ago

After unzipping the 2009_5yr manually, it turns out a file in the geoheader has an incorrect encoding:

(I also removed the -q flag to see where it failed for importing geoheader)

psql:create_geoheader.sql:1: NOTICE:  table "geoheader" does not exist, skipping
Creating geoheader comments
Creating temp geoheader
psql:create_tmp_geoheader.sql:1: NOTICE:  table "tmp_geoheader" does not exist, skipping
Creating temp import tables
Importing geoheader
COPY 4946
COPY 7117
COPY 14397
COPY 32732
COPY 2707
COPY 12598
COPY 1899
COPY 15952
COPY 11356
psql:import_geoheader.sql:10: ERROR:  invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x6e 0x63
CONTEXT:  COPY tmp_geoheader, line 965
Failed importing geoheader.
iandees commented 8 years ago

I'm pretty sure these files are all latin-1.

peterwang998 commented 8 years ago

Good call!

WARNING: CHANGING ENCODING PREVENTS THE LOADING OF METADATA (which is encoded in UTF-8)

Just found this for future reference: https://www.census.gov/geo/maps-data/data/tiger/char_encoding.html

Until 2014, character encoding was in Latin-1 Starting from 2015, character encoding changed to UTF-8

For future reference, if anyone else is having this issue, the safe way to resolve this is to drop the database and then create a new database with the correct encoding (LATIN1)

or use

sudo -u postgres psql -c "update pg_database set encoding = pg_char_to_encoding('LATIN1') where datname = 'census'"

if you're ok with an adhoc solution

iandees commented 8 years ago

If you're trying to get at the data, you can just download the SQL dumps here: http://censusreporter.tumblr.com/post/73727555158/easier-access-to-acs-data

peterwang998 commented 8 years ago

Unfortunately not all the tables are avaliable as a snapshot.

peterwang998 commented 8 years ago

Now i'm getting a duplicate key value error:

COPY 12012
COPY 47601
COPY 6296
COPY 5721
COPY 9645
COPY 2541
COPY 23230
COPY 1571
COPY 29396
COPY 14580
COPY 34314
COPY 17149
COPY 9524
COPY 19446
COPY 12131
COPY 15782
COPY 10689
COPY 12362
COPY 12445
COPY 37573
COPY 27692
Parsing geoheader
psql:parse_tmp_geoheader.sql:55: ERROR:  duplicate key value violates unique constraint "geoheader_pkey"
DETAIL:  Key (geoid)=(04000US49) already exists.
Failed parsing geoheader.
iandees commented 8 years ago

What tables aren't available in the snapshot?

peterwang998 commented 8 years ago

acs2009_5yr

iandees commented 8 years ago

I think you're running into the problems I ran into and didn't have time to work through earlier (since Census Reporter wasn't going to use 2009 data).

I would be extremely happy to review a pull request that fixes this, though!

peterwang998 commented 8 years ago

I'll try my best, not sure if I can find the duplicate key though....