datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
587 stars 72 forks source link

space-delimited name fields with last name first are hard #9

Closed mattkiefer closed 9 years ago

mattkiefer commented 9 years ago

There's one naming convention that's really tricky. It's pretty much the worst thing ever and I have no idea how to solve this: space-delimited name fields with last name first:

In [7]: probablepeople.parse('Woodward Robert U') Out[7]: [('Woodward', 'GivenName'), ('Robert', 'MiddleName'), ('U', 'Surname')]

In [8]: probablepeople.parse('Bernstein Carl') Out[8]: [('Bernstein', 'GivenName'), ('Carl', 'Surname')]

fgregg commented 9 years ago

Here's what we could do:

From a list of names: http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html

We can calculate the proportion of times of times that a given string is a first name or last name. Then we can use that proportion as an additional feature in the model.

We would probably want this to be a special mode because it may really slow down the parsing.

Sound interesting?

mattkiefer commented 9 years ago

Yeah, that sounds very interesting. That said, there's also potential for a learning element where the name-ordering convention - if one exists - is inferred at the level of the data source instead of at the individual record.

For instance, after analyzing a sufficient sample, the peopleparser could first figure out the defined source has a last-first-mi, first-mi-last or some such ordering and then, if there's a clear conclusion, just apply that to the entire set. This could speed up processing and reduce potential for error on the inevitably iffy names (e.g. the Michael Jordans, etc.).

Maybe it's up to the user whether to activate source-level inferences and, if so, how to parcel those out.

On Tue, Mar 3, 2015 at 6:50 PM, Forest Gregg notifications@github.com wrote:

Here's what we could do:

From a list of names: http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html

We can calculate the proportion of times of times that a given string is a first name or last name. Then we can use that proportion as an additional feature in the model.

We would probably want this to be a special mode because it may really slow down the parsing.

Sound interesting?

— Reply to this email directly or view it on GitHub https://github.com/datamade/probablepeople/issues/9#issuecomment-77075400 .

fgregg commented 9 years ago

Interesting. For your case, you could just blow away the existing training data and just train on your data. This would effectively make a parser just for this case.

On Tue, Mar 3, 2015 at 7:55 PM mattkiefer notifications@github.com wrote:

Yeah, that sounds very interesting. That said, there's also potential for a learning element where the name-ordering convention - if one exists - is inferred at the level of the data source instead of at the individual record.

For instance, after analyzing a sufficient sample, the peopleparser could first figure out the defined source has a last-first-mi, first-mi-last or some such ordering and then, if there's a clear conclusion, just apply that to the entire set. This could speed up processing and reduce potential for error on the inevitably iffy names (e.g. the Michael Jordans, etc.).

Maybe it's up to the user whether to activate source-level inferences and, if so, how to parcel those out.

On Tue, Mar 3, 2015 at 6:50 PM, Forest Gregg notifications@github.com wrote:

Here's what we could do:

From a list of names:

http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html

We can calculate the proportion of times of times that a given string is a first name or last name. Then we can use that proportion as an additional feature in the model.

We would probably want this to be a special mode because it may really slow down the parsing.

Sound interesting?

— Reply to this email directly or view it on GitHub < https://github.com/datamade/probablepeople/issues/9#issuecomment-77075400> .

— Reply to this email directly or view it on GitHub https://github.com/datamade/probablepeople/issues/9#issuecomment-77082706 .

fgregg commented 9 years ago

We could also use those lists to guess gender.

fgregg commented 9 years ago

@mattkiefer can you share some examples for training.

mattkiefer commented 9 years ago

The thing about my data is they're a few hundred sources with a few hundred records each, on average. Is that useful?

fgregg commented 9 years ago

20 records would be useful. share what you can.

On Fri, Mar 13, 2015 at 10:57 PM matt kiefer notifications@github.com wrote:

The thing about my data is they're a few hundred sources with a few hundred records each, on average. Is that useful?

— Reply to this email directly or view it on GitHub https://github.com/datamade/probablepeople/issues/9#issuecomment-79810954 .

mattkiefer commented 9 years ago

Here's 20 ... I have at least 100 if you need more: Name,Start Date,Department,Title Brines Mark R,12/20/1982,07Police,Lieutenant Macaluso David P,3/31/1994,07Police,Lieutenant Martin Lawrence,1/19/1990,07Police,Lieutenant Rathmell Randall J,5/23/1989,07Police,Lieutenant Walsh John B,4/20/1990,07Police,Deputy Chief Cook Aaron,5/19/2009,05Bldg,Development Manager Meyer Charles Lowell,12/13/2012,03Admin,Asst. to the VM Letson Andrew,1/26/2015,09PWAdmi,Assistant to the PWD Clarke Timothy M,1/30/1995,06EconDe,Comm Dev Director Merkel Robert,4/9/2003,04Fin,Finance Director Petroshius Douglas Joseph,8/23/2004,03Admin,Asst Village Manager Hincapie Janice,12/26/2006,14PRAdm,Parks & Rec Director LaMantia Robert M.,10/1/2006,07Police,Police Chief Engelmann Ashley,4/7/2008,09PWAdmi,Public Works Director Glowacki Julie,6/30/2014,14PRAdm,Clerk / Receptionist Braovac Jozefina,8/26/2002,05Bldg,Account Clerk Padron Andrea,11/20/2005,04Fin,Account Clerk Ramos Jissenia,4/7/2008,04Fin,Account Clerk Swanson Peter A,8/1/1975,07Police,Records Clerk Weidner Mark S,7/15/2013,07Police,CSO

jernsthausen commented 9 years ago

@fgregg re: gender, if it's useful, we have voter registration data for GA that is parsed and has a field for gender, among other variables. Could probably get the same for some other states relatively easily as well.

fgregg commented 9 years ago

@mattkiefer could you try the newest version probablepeople and see if it works for your data.

stevevance commented 9 years ago

If @mattkiefer's data is always in that order, couldn't you split a string into array items and re-order so that the [0] item becomes the last item?

fgregg commented 9 years ago

Closing for now, @mattkiefer would still like your feedback.

mattkiefer commented 9 years ago

Hey, @fgregg et al, sorry for late reply -- out of town last week. Everything looks great after pip install upgrade-ing probablepeople and rerunning the parser.

Here are the results ... thanks again for jumping on the issue:

|----------------+---------------------+-------------+-----------------+---------------| | attachment_id | processed_timestamp | agency | last_name | first_name | |----------------+---------------------+-------------+-----------------+---------------| | 373 | 2015-04-07-17:54 | Lincolnwood | Brines | Mark R | | 373 | 2015-04-07-17:54 | Lincolnwood | Macaluso | David P | | 373 | 2015-04-07-17:54 | Lincolnwood | Martin | Lawrence | | 373 | 2015-04-07-17:54 | Lincolnwood | Rathmell | Randall J | | 373 | 2015-04-07-17:54 | Lincolnwood | Walsh | John B | | 373 | 2015-04-07-17:54 | Lincolnwood | Cook | Aaron | | 373 | 2015-04-07-17:54 | Lincolnwood | Meyer | Charles | | 373 | 2015-04-07-17:54 | Lincolnwood | Letson | Andrew | | 373 | 2015-04-07-17:54 | Lincolnwood | Clarke | Timothy M | | 373 | 2015-04-07-17:54 | Lincolnwood | Merkel | Robert | | 373 | 2015-04-07-17:54 | Lincolnwood | Petroshius | Douglas | | 373 | 2015-04-07-17:54 | Lincolnwood | Hincapie | Janice | | 373 | 2015-04-07-17:54 | Lincolnwood | LaMantia | Robert M. | | 373 | 2015-04-07-17:54 | Lincolnwood | Engelmann | Ashley | | 373 | 2015-04-07-17:54 | Lincolnwood | Glowacki | Julie | | 373 | 2015-04-07-17:54 | Lincolnwood | Braovac | Jozefina | | 373 | 2015-04-07-17:54 | Lincolnwood | Padron | Andrea | | 373 | 2015-04-07-17:54 | Lincolnwood | Ramos | Jissenia | | 373 | 2015-04-07-17:54 | Lincolnwood | Swanson | Peter A | | 373 | 2015-04-07-17:54 | Lincolnwood | Weidner | Mark S |

mattkiefer commented 9 years ago

If I understand @stevevance 's suggestion, I was doing the explicit re-ordering before but that sort of thing works best when all the data have the same ordering.

In my case, I have close to 500 data sources and trying to infer ordering on a case-by-case basis, doing the best best I can to avoid hard-coding special cases in to my transformations. (e.g., it's relatively simple when, say, an incoming csv has field headers like 'First Name', 'Last Name' etc. but difficult from my standpoint when it's a generic concatenated 'Name' field -- and more so when the ordering is backwards without a comma delimiter.)

I will keep a look out for other incoming data with this convention and keep an eye on things. Thanks again! This is great!