Closed htoothrot closed 9 years ago
This would be considered a bug if "van" is more common as a first name than it is as a prefix to last names, e.g. "Vincent Van Gogh". I'm not actually sure which is more common, probably depends more on wether you're dealing with more Asian vs European names.
You can work around it for your dataset by adjusting the prefixes
constant to remove "van", e.g.:
>>> from nameparser.config import CONSTANTS
>>> CONSTANTS.prefixes.remove('van')
It does seem if there are only 2 name parts like in your example, it would be correct more often if we assume it is a first name. I guess that's probably true for all the prefixes, so maybe a change we could make to the parser.
I'm of the opinion that assuming it's a first name when there are only 2 parts is an improvement, as even in the case where it is only a last name like "Van Gogh", it's still wrong in saying all of "Van Gogh" is the first name.
I did consider removing that prefix, but the names I'm handling aren't primarily Asian, so it would just move the problem.
this might do it cc982310ff7ec69515b18bd371e3a68c25c2813c
@derek73 I tested this branch and it seems to work as intended. Thank you for the work you have done.
I also noticed this when reviewing some of the names (this issue was already present before this change though, so I might ought to open a new issue):
>>> HumanName('Van Nguyen III')
14: <HumanName : [
title: ''
first: 'Van Nguyen'
middle: ''
last: 'III'
suffix: ''
nickname: ''
]>
>>>
Any idea when this will make it into a release? I am running into this particular problem as well, trying to figure out what default behavior for "two word names with first word that could be first name or prefix" I should plan for.
Thanks!
Jon Morgan
Thank you very much! If there is ever anything I can do to help with this project, let me know. I work with news content, and having a concrete, standardized way of parsing names, then using the parsed result to generate full name strings has been a substantial help.