datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
593 stars 71 forks source link

CMS Physicians data set #50

Closed az0 closed 5 years ago

az0 commented 7 years ago

The CMS Physician Compare National is a clean, structured data set with 1M people in the US. I ran it through common mutations including omitting or adding prefixes (Dr, Miss, etc.), using commas or not (for suffix and credentials), using periods or not (e.g., M.D. vs MD), including the credentials (MD, DDS) or not, etc.

Then I tagged it with probableparser, and it passed on 92% of records. The most common error category was a missing surname (sometimes it did not parser at all or thought the doctor was a corporation) or the wrong surname (such as given name in the surname field).

Prefixes significantly impacted tagging. It struggled with the prefix Miss with a 14% error rate compared to a 4% error rate for Mrs, and please keep in mind these prefixes were randomly assigned in the mutation step, so it seems the prefix itself caused an 13 point decline.

What would be the best way to add the errors as training data to the upstream project? It would be easy to automate a large training set from the original CMS Physicians data set because it is structured, but I saw your advice to "start with a few (<5) examples." I may be willing to create a pull request.

My code and a sample of 1K errors are in a new GitHub repository

We can consider it beyond the specific scope of this issue ticket, but FYI: I would like to continue this cooperation after the CMS Physicians data set. I am also working on data sets of businesses and churches, and I can open separate issues/PRs for those.

P.S., thank you for the useful library.

fgregg commented 7 years ago

This is great, @az0

The most useful thing would be to

  1. take a sample of incorrect records (say 30), label those, and add them to the training set: https://github.com/datamade/probablepeople/blob/master/name_data/labeled/person_labeled.xml
  2. Retrain the model with this additional data
  3. Take a sample of errors produced by the new model

Repeat until there are no more errors. Am I making sense?

We'd welcome that same approach for businesses and churches.

az0 commented 7 years ago

Yes, this makes sense. I will give that a try.

az0 commented 7 years ago

OK, I labeled some examples, trained the model, and I retested. I started by retesting the exact cases that I just labeled, but they still have errors. Is this normal?

Example labeled data <Name><PrefixMarital>Miss</PrefixMarital> <GivenName>JEWELL</GivenName> <MiddleName>ANN</MiddleName> <Surname>CRAWFORD</Surname></Name>

Example error

>>> pp.tag('Miss JEWELL ANN CRAWFORD')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Python27\lib\site-packages\probablepeople\__init__.py", line 132, in
tag
    raise RepeatedLabelError(raw_string, parse(raw_string), label)
probablepeople.RepeatedLabelError:
ERROR: Unable to tag this string because more than one area of the string has th
e same label

ORIGINAL STRING:  Miss JEWELL ANN CRAWFORD
PARSED TOKENS:    [('Miss', 'GivenName'), ('JEWELL', 'Surname'), ('ANN', 'GivenN
ame'), ('CRAWFORD', 'Surname')]
UNCERTAIN LABEL:  GivenName
fgregg commented 7 years ago

It can be. Are you also using the rest of the existing training data?

az0 commented 7 years ago

Yes, I used a command like this parserator train name_data\labeled\person_labeled.xml,phys.xml probablepeople --modelfile=person

The person_learned_settings.crfsuite increased from 206KB to 213KB, and then I moved the .crfsuite file from my project folder to C:\Python27\Lib\site-packages\probablepeople so that it takes effect

My first round was labeling 30 cases, so it sounds like I should just continue labeling

fgregg commented 7 years ago

yeah, just keep labeling