Closed az0 closed 5 years ago
This is great, @az0
The most useful thing would be to
Repeat until there are no more errors. Am I making sense?
We'd welcome that same approach for businesses and churches.
Yes, this makes sense. I will give that a try.
OK, I labeled some examples, trained the model, and I retested. I started by retesting the exact cases that I just labeled, but they still have errors. Is this normal?
Example labeled data
<Name><PrefixMarital>Miss</PrefixMarital> <GivenName>JEWELL</GivenName> <MiddleName>ANN</MiddleName> <Surname>CRAWFORD</Surname></Name>
Example error
>>> pp.tag('Miss JEWELL ANN CRAWFORD')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\Python27\lib\site-packages\probablepeople\__init__.py", line 132, in
tag
raise RepeatedLabelError(raw_string, parse(raw_string), label)
probablepeople.RepeatedLabelError:
ERROR: Unable to tag this string because more than one area of the string has th
e same label
ORIGINAL STRING: Miss JEWELL ANN CRAWFORD
PARSED TOKENS: [('Miss', 'GivenName'), ('JEWELL', 'Surname'), ('ANN', 'GivenN
ame'), ('CRAWFORD', 'Surname')]
UNCERTAIN LABEL: GivenName
It can be. Are you also using the rest of the existing training data?
Yes, I used a command like this
parserator train name_data\labeled\person_labeled.xml,phys.xml probablepeople --modelfile=person
The person_learned_settings.crfsuite
increased from 206KB to 213KB, and then I moved the .crfsuite file from my project folder to C:\Python27\Lib\site-packages\probablepeople
so that it takes effect
My first round was labeling 30 cases, so it sounds like I should just continue labeling
yeah, just keep labeling
The CMS Physician Compare National is a clean, structured data set with 1M people in the US. I ran it through common mutations including omitting or adding prefixes (Dr, Miss, etc.), using commas or not (for suffix and credentials), using periods or not (e.g., M.D. vs MD), including the credentials (MD, DDS) or not, etc.
Then I tagged it with probableparser, and it passed on 92% of records. The most common error category was a missing surname (sometimes it did not parser at all or thought the doctor was a corporation) or the wrong surname (such as given name in the surname field).
Prefixes significantly impacted tagging. It struggled with the prefix Miss with a 14% error rate compared to a 4% error rate for Mrs, and please keep in mind these prefixes were randomly assigned in the mutation step, so it seems the prefix itself caused an 13 point decline.
What would be the best way to add the errors as training data to the upstream project? It would be easy to automate a large training set from the original CMS Physicians data set because it is structured, but I saw your advice to "start with a few (<5) examples." I may be willing to create a pull request.
My code and a sample of 1K errors are in a new GitHub repository
We can consider it beyond the specific scope of this issue ticket, but FYI: I would like to continue this cooperation after the CMS Physicians data set. I am also working on data sets of businesses and churches, and I can open separate issues/PRs for those.
P.S., thank you for the useful library.