CreatingData / Historical-Populations

Historical US City populations
http://creatingdata.us/datasets/US-cities
43 stars 2 forks source link

Potential Wikipedia City Population Typos #1

Open jamesfeigenbaum opened 6 years ago

jamesfeigenbaum commented 6 years ago

Not sure if this is the right place for this Ben, but over the years I've collected (from pdfs of census city population tables) city populations for 1890 to 1940 for a decent set of cities. Of the 24711 city x year observations that I have data for and you have data for, only 1325 disagree. This is some combination of data entry errors on my part, data entry errors in Wikipedia, and just weird or bad merging of city names (I did this quick and dirty, so city names showing up multiple times in a state might be an issue). Plus a bunch of CT cities are listed as cities and towns in the raw pdf and I think I punched in a different row than what is on Wikipedia).

The list of disagreements is here: https://www.dropbox.com/s/w8wisqt27mir2hh/wiki_edits.csv?dl=0 and links to the raw pdfs are below. I wonder if we could get some interested (and compulsive) Wikipedia editors interested to correct as many of the Wikipedia city tables (some fraction of the 1325, but not sure what %). My understanding of this project is that any edits on Wikipedia will eventually flow through to this data, right?

Raw PDFs

1910 pdf with city populations: https://www.dropbox.com/s/4vfuwzkh3hmysfp/census_1910.pdf?dl=0

1930 and 1940 pdfs with city populations: https://www.dropbox.com/sh/ia56uz1bs13oaep/AADjHoxKJ1N3WS5vkNEGGoRla?dl=0

sergiocorreia commented 6 years ago

Hi James,

I'm not Ben, but I suspect it's not entirely trivial to to propagate changes from Wikipedia to here. Also, Jacob Alperin-Sheriff has a bot/script that populates wikipedia from an external source, so an alternative might be to explore fixing the data here and then running the script on the changes (we might want to ask him).

bmschmidt commented 6 years ago

Thanks James! This is as good a place to discuss as any. 5% error rate is just slightly higher than I'd have expected, but sounds right with name-merging issues. The Connecticut city-town distinction is something that shows up in the wiki-CESTA diff as well, and in New York and possibly other states.

Re propagating from Wikipedia--it will happen whenever I re-download wikipedia and run the scripts. It's possible I could patch together the code to do this automatically once a month or so, but so far I've only done that twice, because downloading all of Wikipedia is a pain. My impression in the last week is that there's some talk in the wikiverse about having wikidata drive the data tables on wikipedia, which would certainly make this kind of work easier than parsing the variety of population tables online.

The data you're comparing to is not solely the Wikipedia data: it also includes some from CESTA and from Alperin-Sheriff not included in Wikipedia. (My impression is that A-S was careful about not overwriting existing populations, which sometimes means that wiki has, say, the populations of an Indiana township instead of the municipality of the same name even when A-S typed up better data.) The raw wikipedia data is available in the csvs under the column wiki_pops: the full wiki series is a single cell going backward from 2010. So Abingdon, Virginia is listed "8191,7780,7003,4318,4376,4758,4709,3158,2877,2532,1757,1306,1674,1064,715,0,0,0,0,0,0,0,0" is wikipedia, meaning 2010 pop is 8191. (There are equivalent columns for 'alperin_pops' and 'cesta_pops').

If you want to promote wikipedia edits, we'd do better to compare to the actual wikipedia data than my merged series. Happy to help with this.

Did you have this data keyed yourself, or use some other source? If the former, it would be useful as a tie-breaker when some of the existing sources disagree.

bmschmidt commented 6 years ago

I just want to note a couple classes of error I see on skimming the wiki edits.

Manhatta KS manhatta 1900 2996 3438
McPherso KS mcpherso 1900 3438 2996F167
jamesfeigenbaum commented 6 years ago

A few quick answers, but let me preface that my understanding of the wiki v A-S v CESTA data is way more limited than everyone else's! Thanks to both of you for the feedback.

If we want to compare the "raw-er" excel files from my data entry to wikipedia, I think I see now which variables in the merged.csv file will let me do that. But unless there's a more clever way I haven't thought of, I'll still have to merge on city or town name and state.

My data entry for 1920 to 1940 is here: https://www.dropbox.com/s/mbt7lzdx206kk8n/Combined%20Population%20Data%2C%201920-1940.xlsx?dl=0

And my data entry for 1890 to 1910 is here: https://www.dropbox.com/s/3izvk59ymg8961w/census_1910_upwork.csv?dl=0

(based on the pdfs in my first comment)

bmschmidt commented 6 years ago

On merging: what I've done, and I believe @sergiocorreia has done as well, is to primarily identify a city as its census populations over time. Matching code here. So if two cities in the same state have the same populations in 1980, 1990, and 2000, I presume they're the same in both the Wiki and CESTA sets regardless of what their name is. Any city that makes more than three population matches with another (or is always the same, as with towns with only one or two appearances) is assumed to be the same; then I only use string distance after the fact to look for mistakes or break ties. With just the years 1890 to 1940 for cities over 2,500, I think this works for a merge method in the case of individual year errors. It would fail if there are areas where years or columns are off-by-rows.

Do your identifiers line up between the two spreadsheets? If so, I can try dropping this data into the existing merge code when I get a chance.

On getting to correct data: yeah, that's why I'm trying to keep this raw data in the set somewhere. I want some merged estimates just to start working, but there does need to be some way of flagging mistakes. The last panel of the data visualization I put online does that for 1890: it's easy enough to produce the same map for any year.