Greater BCP47 compatibility?

despresc commented 3 years ago

Are there plans for greater BCP47 compatibility? The tags es-419, en-gb, en-TP, and cmn, for instance, are not recognized by the library. That last tag is even the preferred encoding of Mandarin Chinese (over the currently-recognized zh-guoyu, and others) according to the IANA registry.

The Region component in particular seems odd, since it will fail to recognize any of the three digit region codes (like 419) registered with the IANA, but it will recognize tags with three digit country codes that already have two letter codes, like in zh-012. (Such codes must not be used in tags, according to https://tools.ietf.org/html/rfc5646#section-2.2.4, item 4.D).

There is also a small issue that may be a problem in the future, though it is probably unlikely: the primary language tag is not defined to include all of ISO 639-1. The IANA will not register any future ISO 639-1 code that is already covered by a three letter code (see https://tools.ietf.org/html/rfc5646#page-11), so this library would have to switch away from the iso639 package in that event (assuming that package is updated to include the code) to remain correct.

despresc commented 3 years ago

Granted, this would represent a pretty significant rewrite of the package. If there is no appetite for that, I may write one that provides full BCP47 coverage.

pbrisbin commented 3 years ago

Sorry for the delayed reply! For some reason I wasn't notified of this Issue. :thinking:

@eborden, do you have any thoughts here?

eborden commented 3 years ago

@despresc I certainly see no problem with continuing to expand coverage of this library. Some decisions such as leveraging ISO639_1 from the iso639 package or Country from the country package were made for expediency and not full compliance. I'd actually love to see those packages improved to be in greater compliance, but I'd gladly welcome pull requests to this repository to continue expanding coverage.

despresc commented 3 years ago

Thanks for the reply! Actually, between opening this issue and your reply, I started work on my own package. I'm not sure how well it can be ported over, since it happens to represent BCP47 tags and subtags differently than how they are in this package, and the parsing and analysis flow is also a bit different. I think I may just continue working on it, since it's fairly well developed at this point. Sorry for the duplicated effort.

cdparks commented 3 years ago

@pbrisbin @eborden this has actually bitten us now - our region parser is overly case-sensitive. We accept en-GB (correctly) but not en-gb (incorrectly)

cdparks commented 3 years ago

@despresc if you have time, can you verify that I've characterized these issues correctly in this comment? See also these pending tests.

despresc commented 3 years ago

Yes, the es-419 and en-TP issues stem from country, which encodes a distinct but overlapping set of country codes from those in the IANA registry.

The cmn issue is due to iso639, which doesn't support the other standards in the ISO 639 series. I should mention that the primary langauge subtags are currently a strict subset of the ISO 639 codes, from what I recall. The registry doesn't contain every ISO 639-3 code, at least.

freckle / bcp47

Greater BCP47 compatibility? #21