Does not work for chinese name

pickfire commented 5 years ago

The parser seems to parse incorrectly for Chinese names in English. (below uses Malaysia's Chinese name)

Names without nickname. Current:

>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Tham'
        middle: 'Jun'
        last: 'Hoe'
        suffix: ''
        nickname: ''
]>

Expected:

>>> name = HumanName('Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Jun Hoe'
        middle: ''
        last: 'Tham'
        suffix: ''
        nickname: ''
]>

Names with nickname. Current:

>>> name = HumanName('Ivan Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Ivan'
        middle: 'Tham Jun'
        last: 'Hoe'
        suffix: ''
        nickname: ''
]>

Expected:

>>> name = HumanName('Ivan Tham Jun Hoe')
>>> name
<HumanName : [
        title: ''
        first: 'Jun Hoe'
        middle: ''
        last: 'Tham'
        suffix: ''
        nickname: 'Ivan'
]>

Chinese name (possible to use jieba to split the name into first and last). Current:

>>> name = HumanName('谭俊浩')
>>> name
<HumanName : [
        title: ''
        first: '谭俊浩'
        middle: ''
        last: ''
        suffix: ''
        nickname: ''
]>

Expected:

>>> name = HumanName('谭俊浩')
>>> name
<HumanName : [
        title: ''
        first: '俊浩'
        middle: ''
        last: ''
        suffix: '谭'
        nickname: ''
]>

China's names are usually a bit different in the sense that they do not have spaces in between first name for English. Example, Tham Junhoe.

derek73 commented 5 years ago

You are correct, this parser does not parse Chinese names correctly.

I know very little about Chinese names, but as far as I can tell from your description Chinese names are in a reverse order to English or Latin/Germanic names. If that's the case, I don't see any way for the parser to know if it's parsing an English name or a Chinese name written in English/Pinyin.

This parser basically splits strings on spaces and then sticks the first name in the first slot, on down the list until there's no more names so that must be the last name. It seems like Chinese names follow a reverse order, and there are different markers for different name parts. So there are some parts of that which seem familiar to how this parser works, but because they are in reverse order it does not seem compatible with also parsing English names without knowing which type you are parsing.

Obviously with Chinese characters you could look at the characters and know that it's Chinese, but with Pinyin I am not aware of a way to know. Even if you did know it was Chinese, you would still potentially write an entirely different parse tree for it that wouldn't be that related to the English one. My lack of knowledge of the structure of Chinese names means I don't know the best way to handle Chinese names and I'm probably not the right person to write a parser for them. I am happy to take pull requests (with tests) and advise on integrating it into this parser project. Let me know if you have any ideas.

derek73 commented 5 years ago

One idea but it's probably not a good one. We could potentially introduce a parameter that made the initial list (of words split on spaces) to be reversed before it handing it off to the parse tree. I sorta doubt it will end up being that simple, but you could test how this would work by just reversing your names and seeing what it spits out. Conceptually it could also be used for right to left languages, but Arabic names also have a fundamentally different structure that's not well supported by this parser except for simple 2-word pairs.

pickfire commented 5 years ago

@derek73 I think the list in https://en.wikipedia.org/wiki/List_of_common_Chinese_surnames is pretty comprehensive but I am not sure if we could use that to split family name (last name). But I am not sure if the list covers everything but most of the rare ones I have seen are there as well.

Otherwise, if we would know if the character is Chinese, maybe we could split it using [n/2]:[n], usually people only have one family name, which also is usually less than the number of characters in the last name.

In my entire life, I only seen the following patters (for last name and first name split character count), maybe we could use this? (the third name is censored)

林丹 - 林 (last name) and 丹 (first name)
王小明 - 王 (last name) and 小明 (first name)
陈刘叉叉 - 陈刘 (last name) and 叉叉 (first name)

I never seen anyone with first name being shorter than last name.

datatalking commented 4 years ago

With this enhancement, is there a label that the user can add say language = Mandarin? The user would be responsible for indicating the language but then we would know what morphology rules to use.

weegolo commented 3 years ago

Just to complicate matters, I commonly see CJK names written as either "Tham Jun Ho" or "Jun Ho Tham" interchangeably, as individuals shift from the Chinese (Familyname Firstnames) to Western (Firstname Familyname) approaches.

This, plus all of the complexity around the nickname and spacing or lack thereof in the two halves of the first name ("Jun Ho" vs "Junho") makes parsing based on position in string difficult. This means we'd have to guess which of the names is a Family name.

The wikipedia list is not comprehensive: it only lists 30 of the 1,500 Taiwanese Familynames, for example. The Faker package does generate random names, has localisation packages that include lists of surnames in each locale, and is actively maintained - that might be a useful resource?

The CJK diasporas mean an approach based on language or locale is unlikely to work. You're just as likely to find Tham Jun Ho in Sydney as in Seoul.

I'd be happy to help on this but don't have the experience to lead

pickfire commented 3 years ago

I think something harder to solve is that some people have two characters as family name such as Tan Liu. So 3 words may be parsed as two ways below

Tan | Liu Ho Tan Liu | Ho (since some people may have a single word for name)

But if 4 words likely it may be 2 words with 2 words. Maybe it can be hard-coded since the cases are rare for family name that contains two words?

derek73 / python-nameparser

Does not work for chinese name #83