derek73 / python-nameparser

A simple Python module for parsing human names into their individual components
http://nameparser.readthedocs.org/en/latest/
Other
657 stars 104 forks source link

Check for known suffixes when processing nicknames #111

Open aikimark opened 4 years ago

aikimark commented 4 years ago

In my names, I have 'complicated' text. There are situations where there might be both a delimited nickname and a delimited suffix. The suffix text is being added to the nickname.

Since you have known suffixes, I would hope it easy to check the parsed nickname against the known suffix list, adding the text to suffixes. I looked closely at this part of the code yet.

Another possible logic addition might be to check the location of the delimited nickname and add any additional (beyond the first) items to the suffix.

I would rather discuss this idea with you before looking at the code. You might know, in advance, that I'll break some important functionality.

derek73 commented 4 years ago

Currently the way the parser identifies nicknames is to strip out anything that is inside of parentheses or double quotes and stick it in the nickname bucket before parsing the name string. If I understand correctly, you have instances where a suffix is included in parentheses or double quotes? Just curious, could you provide an example or two?

It does not seem like it would be too hard to check if things inside parentheses or double quotes are in the suffix list then add them to that bucket instead if they are.

aikimark commented 4 years ago

Here are two names where a parenthesized nickname is part/whole of the suffix:

Andrew Perkins, Jr., Col. (Ret)
Lon (Jr.) Williams

Here are two names where the nickname is most likely a nickname.

JEFFREY (JD) BRICKEN
JEFFREY D 'JD' KEISTER

Here are two names where multiple delimiters are used. The double apostrophe was probably used in place of a quote character. Since this data was imported from CSV, I assume that the CSV format output process, or some process/software upstream of that, prevents quote characters inside fields.

WILLIAM A (''DREW'') MARSH III
S.E. ''ED'' WHITE

I also have a mixture of "MD", "M.D.", "M.D", "J.D.", "J.D", and "JD" titles, that should parse correctly. I haven't checked whether your patterns cover these variants. I thought, incorrectly, that some of these were delimited in such a way to be interpreted as nicknames.