Open fgregg opened 9 years ago
Our current training data consists of:
Right now the parser is pretty ok at figuring out names like 'Mr. Bob A. Smith Jr" but hasn't seen name formats outside of the GA campaign finance data - namely, common surname-first formats like "Smith, Bob A"
@waldoj & @jernsthausen - do either of you have good sources of messy names (e.g. names w/ varying formats/structures) to train this name parser on?
Absolutely. My plane is about to take off, but I'll get you a few hundred thousand.
At http://vabusinesses.org/2_corporate.csv, you can extract column 17. That gets you 180,428 unique entries, although now that I look at the list, there is a certain percentage that are business names, instead of human names. So I can see how that might be problematic. In http://vabusinesses.org/9_llc.csv, they're in column 16, and there are 268,833 unique names there (many of which are surely also present in 2_corporate.csv
), with the same caveat that some are business names. Finally, there's http://vabusinesses.org/5_officers.csv, from which you'd want columns 2–4, which provides 517,197 unique names. They're separated into first, middle, and last, but of course you could just concatenate them. Every one of those should be actual humans.
So that's 966,458 names, more or less.
(I can get you more, BTW, if you like.)
@waldoj this is wonderful - thanks!!
Here's a big list of drug and medical device manufacturer corporation names: https://openpaymentsdata.cms.gov/dataset/AMGPO-Lookup-File/2j4a-fwnz
After @derekeder 's NICAR-L announcement, I ran all the names starting with "A" through the bulk parser and found a 6% error rate. (Companies tagged as people.) One of the bigger contributors to the error rate was single tokens tagged as people. For example, AXOGEN was tagged as a person but AXOGEN CORPORATION was tagged as a corporation. If we are evaluating only western names, I think single tokens have a high probability of being corporation names.
looks like 85/1148 were parsed incorrectly, & most failures were more than one token, but had some tricky words that the parser hasn't seen before. these failures are great training examples, thanks @martinburch! I'll work on adding some of these to the training data.
Surname counts from 2000 US Census http://www.ssa.gov/oact/babynames/decades/century.html
Given name counts from Social Security, born 1800 - present. http://www.census.gov/topics/population/genealogy/data/2000_surnames.html
Death master file - deaths in the US, mostly around 1970+: http://cancelthesefunerals.com/
Common names and surnames with aliases -- when checked against the death master file, some are not that common at all: https://bitbucket.org/openesb/mdm-legacy/src/master/open-dm-mi/solutions/aus-patient/AUSPatient/src/StandardizationEngine/PersonName/instance/AU/resource/
hey @Downchuck - we already have the most common names from the census. do any of these datasets contain full name strings, instead of name components that are already split out?
No, far as I understand they're already split up; the death index has pairs (full names); I don't know that they'd be of use though.
http://www.cs.cmu.edu/~einat/datasets.html