derek73 / python-nameparser

A simple Python module for parsing human names into their individual components
http://nameparser.readthedocs.org/en/latest/
Other
657 stars 104 forks source link

Judge-related titles not parsing #9

Closed end0 closed 10 years ago

end0 commented 10 years ago

Hey -

First off, awesome package. I've been working with a dataset of ~3000 judges and associated titles, and noticed nameparser doesn't pick most (well, any) of them up. Below is the filtered list with at least a few examples/variations on each. I'm happy to do the changes if you'd like. Let me know.

common

Magistrate Judge John F. Forster, Jr Magistrate Judge Joaquin V.E. Manibusan, Jr Magistrate-Judge Elizabeth Todd Campbell Mag-Judge Harwell G Davis, III Mag. Judge Byron G. Cudmore Chief Judge J. Leon Holmes Chief Judge Sharon Lovelace Blackburn Judge James M. Moody Judge G. Thomas Eisele Judge Callie V. S. Granade Judge C Lynwood Smith, Jr Senior Judge Charles R. Butler, Jr Senior Judge Harold D. Vietor Senior Judge Virgil Pittman
Honorable Terry F. Moorer Honorable W. Harold Albritton, III Honorable Judge W. Harold Albritton, III Honorable Judge Terry F. Moorer Honorable Judge Susan Russ Walker Hon. Marian W. Payson Hon. Charles J. Siragusa

rare

US Magistrate Judge T Michael Putnam Designated Judge David A. Ezra Sr US District Judge Richard G Kopf

end0 commented 10 years ago

Playing with it a bit. It seems that the below set, added to TITLES seems to solve the common cases, and even some of the rarer one.

['senior', 'magistrate', 'judge', 'mag', 'judge', 'magistrate-judge', 'mag-judge''honorable', 'hon', 'designated']

I'm happy to donate the full names dataset to you for testing purposes. Just let me know.

derek73 commented 10 years ago

Wow, I'm surprised that some of those titles aren't already in there. Thanks very much for passing those along. I'll add them to the project's constant and reopen this issue to track it.

I would gladly accept your real world dataset to test against. I haven't gotten to test it against much real world data.

I'm surprised by the last two you mention. "Hon." should be correctly parsed. e.g:

Sprout:python-nameparser derek$ ./tests.py "Hon. Charles J. Siragusa"
<HumanName : [
    Title: 'Hon.' 
    First: 'Charles' 
    Middle: 'J.' 
    Last: 'Siragusa' 
    Suffix: ''
    Nickname: ''
]>

Are you using version 0.2.9? (the latest)

If you happen to be dealing with a number of first names and titles like "Sir Gerald" there's some changes in master that might be helpful. See #7

end0 commented 10 years ago

I have a dataset ready for you (though there are some non-conforming names). I'm not sure how to upload directly to the issue tracker, so let me know where to send or how to upload directly. Alternatively, I can paste here, but don't want to pollute the issue tracker.

[EDIT] Just realized I can create a gist which should be just as good. https://gist.github.com/end0/daa8378d06642b69db77 [/EDIT]

I've been playing with it a bit more, and look like this set takes care of most of the issues, though I worry about over-specifying for this dataset (especially the first half of the set).

['us', 'sr judge', 'special', 'senior-judge', 'pslc', 'pro se', 'law clerk', 'docket', 'mag/judge', 'federal', 'edmi', 'discovery', 'senior', 'magistrate', 'judge', 'mag', 'judge', 'magistrate-judge', 'mag-judge', 'honorable', 'hon', 'designated', 'district']

Just FYI, these are the names / titles of federal US judges. It may not encompass state or other courts properly, but hopefully a good start.

derek73 commented 10 years ago

Thanks a bunch. That's really helpful. I'll get those added to the project.

Titles can be chained, so adding "sr" and "judge" should also take care of "sr judge". The same is not true of sr-judge, though I'm wondering if that might not be a bad idea (count "-" or "/" as a space to allow joining when appearing in titles).

The only reason that a potential title should be omitted from the titles constant is if it could also be a first name. Any strings in the titles constant will never be considered a first name. Other than that, I don't see any reason to not include every possible title.

derek73 commented 10 years ago

I think I got all these added, as well as a few that some quick googling turned up. The data you provide made me notice a few other things we might be able to handle better too. Thanks for providing it.