datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
589 stars 72 forks source link

Best Practice for training data? #19

Open Shotgunosine opened 9 years ago

Shotgunosine commented 9 years ago

I'm trying to improve performance of the parser on a fairly messy list containing individuals, households, and corporations. For individuals and households the parser works great. For corporations I see lots of listings like: Acme LLC, A Delaware Limited Liability Company

Currently the tagging for that will be:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | CorporationName             |
| DELAWARE  | CorporationName             |
| LIMITED   | CorporationName             |
| LIABILITY | CorporationName             |
| COMPANY   | CorporationNameOrganization |

I think ideally the result would be something like:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | Article                      |
| DELAWARE  | Location                    |
| LIMITED   | CorporationNameOrganization |
| LIABILITY | CorporationNameOrganization |
| COMPANY   | CorporationNameOrganization |

In addition to adding "Article" and "Location" labels, I was thinking I would add edit distance to a state name as a feature.

My question is about how much training data I should use. Is it purely a situation where more examples will be better? Or should I add a few core examples and then augment those with problem cases as they come up?

fgregg commented 9 years ago

That's a pretty strange one.

So right now we are using CorporationNameOrganization for things like

This is not really what's going on with A Deleware Limited Liability Company I would say that that is not really part of the name at all.