datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
593 stars 71 forks source link

churches recognized as person instead of corporation #52

Open az0 opened 7 years ago

az0 commented 7 years ago

I extracted 8964 entities from Wikidata related to the Wikipedia category Churches in the United States, and then I manually filtered the list to 7000 records that really look like churches. I pulled a simple random sample of 100 records, and I fed them into the probablepeople bulk parsing service.

The following 7 records were incorrectly recognized as people instead of churches.

St. Augustine Church Bethany Lutheran Church Augustus Lutheran Church Grace Episcopal Church St. Joseph Church Grace Church

Would you like a PR with this as training data? Is 7 items of training data enough?

Because this was a simple random sample, we can estimate the error interval at 3% to 14%, so on the population of 7000 records that is 217 to 1006 errors.

By the way, this can be complicated because Church is surname. Looking at the list above, it gets complicated because Bethany or Grace can be either a given name or partial church name.

fgregg commented 7 years ago

Yes, please make a PR!

az0 commented 7 years ago

I'm guessing I need to use labels 12 or 13 for most church names. Would you please clarify the difference between 12 and 13?

0 : PrefixMarital
1 : PrefixOther
2 : GivenName
3 : FirstInitial
4 : MiddleName
5 : MiddleInitial
6 : Surname
7 : LastInitial
8 : SuffixGenerational
9 : SuffixOther
10 : Nickname
11 : And
12 : CorporationName
13 : CorporationNameOrganization
14 : CorporationNameAndCompany
15 : CorporationNameBranchType
16 : CorporationNameBranchIdentifier
17 : CorporationCommitteeType
18 : CorporationLegalType
19 : ShortForm
20 : ProxyFor
21 : AKA
az0 commented 7 years ago

Looking more closely at labeled data, it looks like CorporationNameOrganization is the type of organization in the name such as church, school, law firm, company, or corporation, while CorporationName is used for the specific part of the name such as St. Joseph.

However, there are only three churches in the labeled data, and only one of them follows this rule, so it would help if you could clarify how I should label the churches

az0 commented 7 years ago

I opened a PR to close this issue https://github.com/datamade/probablepeople/pull/55