derek73 / python-nameparser

A simple Python module for parsing human names into their individual components
http://nameparser.readthedocs.org/en/latest/
Other
653 stars 103 forks source link

First name Van (which is also sometimes a prefix) not handled correctly #24

Closed htoothrot closed 9 years ago

htoothrot commented 9 years ago
>>> from nameparser import HumanName
... HumanName('Van Nguyen')
0: <HumanName : [
    title: '' 
    first: 'Van Nguyen' 
    middle: '' 
    last: '' 
    suffix: ''
    nickname: ''
]>
>>> import nameparser
>>> nameparser.VERSION
1: (0, 3, 3)
>>>
derek73 commented 9 years ago

This would be considered a bug if "van" is more common as a first name than it is as a prefix to last names, e.g. "Vincent Van Gogh". I'm not actually sure which is more common, probably depends more on wether you're dealing with more Asian vs European names.

You can work around it for your dataset by adjusting the prefixes constant to remove "van", e.g.:

>>>  from nameparser.config import CONSTANTS
>>>  CONSTANTS.prefixes.remove('van')

It does seem if there are only 2 name parts like in your example, it would be correct more often if we assume it is a first name. I guess that's probably true for all the prefixes, so maybe a change we could make to the parser.

htoothrot commented 9 years ago

I'm of the opinion that assuming it's a first name when there are only 2 parts is an improvement, as even in the case where it is only a last name like "Van Gogh", it's still wrong in saying all of "Van Gogh" is the first name.

I did consider removing that prefix, but the names I'm handling aren't primarily Asian, so it would just move the problem.

derek73 commented 9 years ago

this might do it cc982310ff7ec69515b18bd371e3a68c25c2813c

htoothrot commented 9 years ago

@derek73 I tested this branch and it seems to work as intended. Thank you for the work you have done.

I also noticed this when reviewing some of the names (this issue was already present before this change though, so I might ought to open a new issue):

>>> HumanName('Van Nguyen III')
14: <HumanName : [
    title: '' 
    first: 'Van Nguyen' 
    middle: '' 
    last: 'III' 
    suffix: ''
    nickname: ''
]>
>>> 
jonathanmorgan commented 9 years ago

Any idea when this will make it into a release? I am running into this particular problem as well, trying to figure out what default behavior for "two word names with first word that could be first name or prefix" I should plan for.

Thanks!

Jon Morgan

jonathanmorgan commented 9 years ago

Thank you very much! If there is ever anything I can do to help with this project, let me know. I work with news content, and having a concrete, standardized way of parsing names, then using the parsed result to generate full name strings has been a substantial help.