Best Practice for training data?

I'm trying to improve performance of the parser on a fairly messy list containing individuals, households, and corporations. For individuals and households the parser works great. For corporations I see lots of listings like: Acme LLC, A Delaware Limited Liability Company

Currently the tagging for that will be:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | CorporationName             |
| DELAWARE  | CorporationName             |
| LIMITED   | CorporationName             |
| LIABILITY | CorporationName             |
| COMPANY   | CorporationNameOrganization |

I think ideally the result would be something like:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | Article                      |
| DELAWARE  | Location                    |
| LIMITED   | CorporationNameOrganization |
| LIABILITY | CorporationNameOrganization |
| COMPANY   | CorporationNameOrganization |

In addition to adding "Article" and "Location" labels, I was thinking I would add edit distance to a state name as a feature.

My question is about how much training data I should use. Is it purely a situation where more examples will be better? Or should I add a few core examples and then augment those with problem cases as they come up?

datamade / probablepeople

Best Practice for training data? #19