datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
593 stars 71 forks source link

Sources for names #1

Open fgregg opened 9 years ago

fgregg commented 9 years ago

http://www.cs.cmu.edu/~einat/datasets.html

cathydeng commented 9 years ago

Our current training data consists of:

Right now the parser is pretty ok at figuring out names like 'Mr. Bob A. Smith Jr" but hasn't seen name formats outside of the GA campaign finance data - namely, common surname-first formats like "Smith, Bob A"

@waldoj & @jernsthausen - do either of you have good sources of messy names (e.g. names w/ varying formats/structures) to train this name parser on?

waldoj commented 9 years ago

Absolutely. My plane is about to take off, but I'll get you a few hundred thousand.

waldoj commented 9 years ago

At http://vabusinesses.org/2_corporate.csv, you can extract column 17. That gets you 180,428 unique entries, although now that I look at the list, there is a certain percentage that are business names, instead of human names. So I can see how that might be problematic. In http://vabusinesses.org/9_llc.csv, they're in column 16, and there are 268,833 unique names there (many of which are surely also present in 2_corporate.csv), with the same caveat that some are business names. Finally, there's http://vabusinesses.org/5_officers.csv, from which you'd want columns 2–4, which provides 517,197 unique names. They're separated into first, middle, and last, but of course you could just concatenate them. Every one of those should be actual humans.

So that's 966,458 names, more or less.

waldoj commented 9 years ago

(I can get you more, BTW, if you like.)

cathydeng commented 9 years ago

@waldoj this is wonderful - thanks!!

martinburch commented 9 years ago

Here's a big list of drug and medical device manufacturer corporation names: https://openpaymentsdata.cms.gov/dataset/AMGPO-Lookup-File/2j4a-fwnz

After @derekeder 's NICAR-L announcement, I ran all the names starting with "A" through the bulk parser and found a 6% error rate. (Companies tagged as people.) One of the bigger contributors to the error rate was single tokens tagged as people. For example, AXOGEN was tagged as a person but AXOGEN CORPORATION was tagged as a corporation. If we are evaluating only western names, I think single tokens have a high probability of being corporation names.

cathydeng commented 9 years ago

looks like 85/1148 were parsed incorrectly, & most failures were more than one token, but had some tricky words that the parser hasn't seen before. these failures are great training examples, thanks @martinburch! I'll work on adding some of these to the training data.

Downchuck commented 9 years ago

Surname counts from 2000 US Census http://www.ssa.gov/oact/babynames/decades/century.html

Given name counts from Social Security, born 1800 - present. http://www.census.gov/topics/population/genealogy/data/2000_surnames.html

Death master file - deaths in the US, mostly around 1970+: http://cancelthesefunerals.com/

Common names and surnames with aliases -- when checked against the death master file, some are not that common at all: https://bitbucket.org/openesb/mdm-legacy/src/master/open-dm-mi/solutions/aus-patient/AUSPatient/src/StandardizationEngine/PersonName/instance/AU/resource/

cathydeng commented 9 years ago

hey @Downchuck - we already have the most common names from the census. do any of these datasets contain full name strings, instead of name components that are already split out?

Downchuck commented 9 years ago

No, far as I understand they're already split up; the death index has pairs (full names); I don't know that they'd be of use though.