berkmancenter / namae

Namae (名前) parses personal names and splits them into their component parts.
159 stars 32 forks source link

Suffixes of V (fifth) or X (tenth) are not parsed correctly #35

Open a-maas opened 4 years ago

a-maas commented 4 years ago

If a name has a suffix of V (fifth), it is considered a family name, not the suffix:

>> Namae.parse("Adam Burren V")
=> [#<Name family="V" given="Adam Burren">]

This is because the suffix regex is:

/\s*\b(JR|Jr|jr|SR|Sr|sr|[IVX]{2,})(\.|\b)/

The part that accounts for roman numeral suffixes is [IVX]{2,}, which looks for 2 or more characters, while V or X would only be one.

Perhaps this is intentional, because looking for a single character may be problematic and cause a lot of false positives, but I wanted to create an issue for it and see.

inukshuk commented 4 years ago

If I remember correctly, your guess is right: we used this default pattern to avoid false positives matching initials. All these patterns are configurable (e.g., Namae::Parser.instance.options for the singleton instance), so if you expect those kind of suffixes (or do not expect initials) you can just use a more suitable default.