datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
587 stars 72 forks source link

Can this be trained for political groups/committees? #14

Closed rkiddy closed 9 years ago

rkiddy commented 9 years ago

It may be that this is not quite the right tool for us, but I wanted to check.

I have a list of ~60,000,000 names from a campaign financing and lobbying activity database from the state of California. There are many, many duplicates, but there are also many malformed rows. The names are supposed to be divided up into "name-last", 'name-first", "name-suffix" and "name-title" but they are most often not. I think we want do a few things with pp:

1) put the names together (most people just use the names this way and ignore any weirdness), and then use pp to separate them and see if they separate out the same, or fix the data with the separated forms.

2) separate out fields when the "name-last" is used for the entire name, as is extremely common.

3) identify names when they are part of an organization name.

An example of what I am seeing now is this:

>>> import probablepeople
>>> 
>>> probablepeople.parse('Aaron Starr for Oxnard City Council')
[('Aaron', 'CorporationName'), ('Starr', 'CorporationName'), ('for', 'CorporationName'), ('Oxnard', 'CorporationName'), ('City', 'CorporationName'), ('Council', 'CorporationNameOrganization')]
>>> 
>>> probablepeople.parse('Aaron Read & Associates')
[('Aaron', 'CorporationName'), ('Read', 'CorporationName'), ('&', 'CorporationName'), ('Associates', 'CorporationNameOrganization')]
>>> 
>>> probablepeople.parse('Aaron Klein')
[('Aaron', 'GivenName'), ('Klein', 'Surname')]
>>>

It would have been really nice to see the following, if I could have my wish. Is there any way to get this behavior or something similar? And is there a reason "for" and "&" are 'CorporationName' and not something like a connector or conjunction? Any suggestions would be appreciated.

>>> import probablepeople
>>> 
>>> probablepeople.parse('Aaron Starr for Oxnard City Council')
[('Aaron', 'GivenName'), ('Starr', 'Surname'), ('for', 'Connector'), ('Oxnard', 'CorporationName'), ('City', 'CorporationName'), ('Council', 'CorporationNameOrganization')]
>>> 
>>> probablepeople.parse('Aaron Read & Associates')
[('Aaron', 'GivenName'), ('Read', 'Surname'), ('&', 'Connector'), ('Associates', 'CorporationNameOrganization')]
>>> 

FYI, I am also looking at cjdd3b/fec-standardizer. Perhaps that might be easier to use. We will see.

znmeb commented 9 years ago

@rkiddy If you get this / these working, we can probably use it with the Oregon ORESTAR database.

fgregg commented 9 years ago

@rkiddy definitely! We've actually been using some lobbyist data already to train probable people. In the our data, we see a company acting on behalf of another company, we haven't seen an individual, so we aren't handling that case well.

We would label your first example as

probablepeople.parse('Aaron Starr for Oxnard City Council')
[('Aaron', 'GivenName'), ('Starr', 'Surname'), ('for', 'ProxyFor'), ('Oxnard', 'CorporationName'), ('City', 'CorporationName'), ('Council', 'CorporationNameOrganization')]

For your second example, I personally think we are labeling it correctly. The name of the business is "Aaron Reed & Associates" and it's a different type of entity than "Aaron Reed" the person.

But, the beauty of probablepeople is that you don't have to agree with us. Fork it, and train the parser as you see fit. Make a PR, and we can see if we line up.

Do you need any help figuring out how to add your own training data using parserator?

rkiddy commented 9 years ago

Hm. Well, @fgregg, I think you are right that "Aaron Reed & Associates" is a different kind of entity than "Aaron Reed" the person. But the "Aaron Starr for Oxnard City Council" example is a different kind of entity from the person for exactly the same reason. Anyone can set up a committee called "<someone> for <something>". There are no rules about the naming of these committees. The point is that I think we want to determine that there is some kind of relationship between these three entities. So....

I am not sure where that leaves me. Must cogitate.

Btw, I also looked at dedupe but that does quite a bit more and seems very complicated to use....

fgregg commented 9 years ago

Oh. I see. It's not an agent-client relation, it's a political committee. In that case, I like our labeling of it all as a corporation.

So records can have a number of different types of relationships. We have been focused on "do these records refer to the same entity". But, we might be interested in "do these records refer to people in the same household" or "do these records refer to a candidate and her political committee."

Those latter relationships are all important but we want to tackle the entity resolution relation one first.

Again, this is just our preference, you can get probablepeople to tag these examples differently.

Sorry to hear about Dedupe. We have tried to make the Python api as easy as possible, but we have plenty of room to improve I'm sure. Did you see the example projects? On Sun, Mar 29, 2015 at 10:50 PM Ray Kiddy notifications@github.com wrote:

Hm. Well, @fgregg https://github.com/fgregg, I think you are right that "Aaron Reed & Associates" is a different kind of entity than "Aaron Reed" the person. But the "Aaron Starr for Oxnard City Council" example is a different kind of entity from the person for exactly the same reason. Anyone can set up a committee called " for ". There are no rules about the naming of these committees. The point is that I think we want to determine that there is some kind of relationship between these three entities. So....

I am not sure where that leaves me. Must cogitate.

Btw, I also looked at dedupe but that does quite a bit more and seems very complicated to use....

— Reply to this email directly or view it on GitHub https://github.com/datamade/probablepeople/issues/14#issuecomment-87534621 .

rkiddy commented 9 years ago

Duh. I did not see that you are the same group that did dedupe. I am glad I did not disparage it over much. :-)

Part of the problem I am having is just that the list of names is huge. I need to chop out a smaller set to experiment with.

Also, I have been using MySQL for a while, but my native language has been java and my python skills are not where they should be.

Your comments about this project, vis a vis its purpose and how it should be used, make sense. I will have to see whether I should do some processing with this, and then use dedupe, or figure out how to use dedupe properly and have that do the work. I am not as sure of the set of problems dedupe can handle. We will see. I will let you know how things progress.

fgregg commented 9 years ago

Okay, let us know if we can help further. Closing this issue.