Potential Wikipedia City Population Typos

jamesfeigenbaum commented 6 years ago

Not sure if this is the right place for this Ben, but over the years I've collected (from pdfs of census city population tables) city populations for 1890 to 1940 for a decent set of cities. Of the 24711 city x year observations that I have data for and you have data for, only 1325 disagree. This is some combination of data entry errors on my part, data entry errors in Wikipedia, and just weird or bad merging of city names (I did this quick and dirty, so city names showing up multiple times in a state might be an issue). Plus a bunch of CT cities are listed as cities and towns in the raw pdf and I think I punched in a different row than what is on Wikipedia).

The list of disagreements is here: https://www.dropbox.com/s/w8wisqt27mir2hh/wiki_edits.csv?dl=0 and links to the raw pdfs are below. I wonder if we could get some interested (and compulsive) Wikipedia editors interested to correct as many of the Wikipedia city tables (some fraction of the 1325, but not sure what %). My understanding of this project is that any edits on Wikipedia will eventually flow through to this data, right?

Raw PDFs

1910 pdf with city populations: https://www.dropbox.com/s/4vfuwzkh3hmysfp/census_1910.pdf?dl=0

1930 and 1940 pdfs with city populations: https://www.dropbox.com/sh/ia56uz1bs13oaep/AADjHoxKJ1N3WS5vkNEGGoRla?dl=0

sergiocorreia commented 6 years ago

Hi James,

I'm not Ben, but I suspect it's not entirely trivial to to propagate changes from Wikipedia to here. Also, Jacob Alperin-Sheriff has a bot/script that populates wikipedia from an external source, so an alternative might be to explore fixing the data here and then running the script on the changes (we might want to ask him).

bmschmidt commented 6 years ago

Thanks James! This is as good a place to discuss as any. 5% error rate is just slightly higher than I'd have expected, but sounds right with name-merging issues. The Connecticut city-town distinction is something that shows up in the wiki-CESTA diff as well, and in New York and possibly other states.

Re propagating from Wikipedia--it will happen whenever I re-download wikipedia and run the scripts. It's possible I could patch together the code to do this automatically once a month or so, but so far I've only done that twice, because downloading all of Wikipedia is a pain. My impression in the last week is that there's some talk in the wikiverse about having wikidata drive the data tables on wikipedia, which would certainly make this kind of work easier than parsing the variety of population tables online.

The data you're comparing to is not solely the Wikipedia data: it also includes some from CESTA and from Alperin-Sheriff not included in Wikipedia. (My impression is that A-S was careful about not overwriting existing populations, which sometimes means that wiki has, say, the populations of an Indiana township instead of the municipality of the same name even when A-S typed up better data.) The raw wikipedia data is available in the csvs under the column wiki_pops: the full wiki series is a single cell going backward from 2010. So Abingdon, Virginia is listed "8191,7780,7003,4318,4376,4758,4709,3158,2877,2532,1757,1306,1674,1064,715,0,0,0,0,0,0,0,0" is wikipedia, meaning 2010 pop is 8191. (There are equivalent columns for 'alperin_pops' and 'cesta_pops').

If you want to promote wikipedia edits, we'd do better to compare to the actual wikipedia data than my merged series. Happy to help with this.

Did you have this data keyed yourself, or use some other source? If the former, it would be useful as a tie-breaker when some of the existing sources disagree.

bmschmidt commented 6 years ago

I just want to note a couple classes of error I see on skimming the wiki edits.

Plenty digit errors (83 -> 33, etc.). When I've checked these on the original 1890 reports, it's more often been the case that wikipedia/Jacob A-S are wrong than that CESTA is wrong.
A number of the errors in the merged set show populations rounded to the nearest hundred; that's because the CESTA dataset includes some rounded data, and my heuristic for picking between the Stanford, Alperin, and Wiki series looks only at smoothness of growth and so will probably choose a rounded value almost half the time when it exists. I should probably refine it to be more skeptical of round numbers from CESTA.
One of these datasets clearly has the populations for Manhattan and McPherson Kansas transposed with each other. I do the merges between wikipedia and CESTA based primarily on the population numbers, not the names (which as you say, is hard), so I wouldn't be able to be catch this kind of mistake at all. (Looking briefly at the CSVs, I suspect in this case it's @jamesfeigenbaum's data that's wrong).
There are the merger problems, which continue to be an issue: "Princeton, New Jersey" has three different entries in my set, for the borough, the township, and the post-2012 merger of the two.
A lot of these errors are in NY, where Alperin-Sheriff entered both townships and places, but I used the township numbers: probably I ought to switch to the municipal lists, because NY townships are a weird beast not analogous to New England towns, which rarely (only in Connecticut?) contain municipalities inside them.
There must be differences between the 1910-1940 Census places over 2,500 publications and the original raw figures published in the decennial reports. CESTA and @jamesfeigenbaum use the latter; Alperin-Sheriff uses the former.

Manhatta	KS	manhatta	1900	2996	3438
McPherso	KS	mcpherso	1900	3438	2996F167

jamesfeigenbaum commented 6 years ago

A few quick answers, but let me preface that my understanding of the wiki v A-S v CESTA data is way more limited than everyone else's! Thanks to both of you for the feedback.

I had the data punched in (1890 to 1910 an RA I've used on upWork before, 1920 to 1940 an RA Dan Gross and I hired via HBS). So there could be entry errors, but I also have the original pdfs we used.
I saw a few of those digit errors as well, lots of problems with 3s and 8s and 5s to 6s
The rounding heuristic is interesting, but if the goal is to eventually get a completely correct data, maybe flagging times the various sources disagree and pushing that somewhere seems useful?

If we want to compare the "raw-er" excel files from my data entry to wikipedia, I think I see now which variables in the merged.csv file will let me do that. But unless there's a more clever way I haven't thought of, I'll still have to merge on city or town name and state.

My data entry for 1920 to 1940 is here: https://www.dropbox.com/s/mbt7lzdx206kk8n/Combined%20Population%20Data%2C%201920-1940.xlsx?dl=0

And my data entry for 1890 to 1910 is here: https://www.dropbox.com/s/3izvk59ymg8961w/census_1910_upwork.csv?dl=0

(based on the pdfs in my first comment)

bmschmidt commented 6 years ago

On merging: what I've done, and I believe @sergiocorreia has done as well, is to primarily identify a city as its census populations over time. Matching code here. So if two cities in the same state have the same populations in 1980, 1990, and 2000, I presume they're the same in both the Wiki and CESTA sets regardless of what their name is. Any city that makes more than three population matches with another (or is always the same, as with towns with only one or two appearances) is assumed to be the same; then I only use string distance after the fact to look for mistakes or break ties. With just the years 1890 to 1940 for cities over 2,500, I think this works for a merge method in the case of individual year errors. It would fail if there are areas where years or columns are off-by-rows.

Do your identifiers line up between the two spreadsheets? If so, I can try dropping this data into the existing merge code when I get a chance.

On getting to correct data: yeah, that's why I'm trying to keep this raw data in the set somewhere. I want some merged estimates just to start working, but there does need to be some way of flagging mistakes. The last panel of the data visualization I put online does that for 1890: it's easy enough to produce the same map for any year.

CreatingData / Historical-Populations

Potential Wikipedia City Population Typos #1

Raw PDFs