bible-technology / scripture-burrito

Scripture Burrito Schema & Docs 🌯
http://docs.burrito.bible/
MIT License
21 stars 13 forks source link

Should we require BCP 47 tags to be in canonical form #161

Closed rdb closed 4 years ago

rdb commented 4 years ago

In #160, I added in a regex that more completely validates BCP 47. It is, however, case insensitive, following section 2.2.1 of BCP 47:

At all times, language tags and their subtags, including private use and extensions, are to be treated as case insensitive: there exist conventions for the capitalization of some of the subtags, but these MUST NOT be taken to carry meaning.

Thus, the tag "mn-Cyrl-MN" is not distinct from "MN-cYRL-mn" or "mN-cYrL-Mn" (or any other combination), and each of these variations conveys the same meaning: Mongolian written in the Cyrillic script as used in Mongolia.

The ABNF syntax also does not distinguish between upper- and lowercase: the uppercase US-ASCII letters in the range 'A' through 'Z' are always considered equivalent and mapped directly to their US-ASCII lowercase equivalents in the range 'a' through 'z'. So the tag "I-AMI" is considered equivalent to that value "i-ami" in the 'irregular' production.

Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. This format generally corresponds to the common conventions for the various ISO standards from which the subtags are derived.

These conventions include:

  • [ISO639-1] recommends that language codes be written in lowercase ('mn' Mongolian).

  • [ISO15924] recommends that script codes use lowercase with the initial letter capitalized ('Cyrl' Cyrillic).

  • [ISO3166-1] recommends that country codes be capitalized ('MN' Mongolia).

Despite this, I would personally favour being a little stricter and requiring tags to be given in their "canonical" form, and validate that each subtag type is cased appropriately. I think that requiring readers to be able to deal with language codes in a case-insensitive manner might invite bugs (since the vast majority of writers will probably write them in canonical form, and non-canonical ones will be rare in practice) and add an extra implementational burden.

The canonical form, which we are using today in our systems, looks like this:

mn-Cyrl-MN

Thoughts?

jag3773 commented 4 years ago

I'm happy to standardize on the conventional capitalization form as it aids readability. Otherwise, I think we should force lowercase to avoid the "I'm not sure if it matters" conundrum.

mvahowe commented 4 years ago

I don't care which way we swing on this, but I suspect that enforcing mixed case is going to be harder than enforcing lower case (or upper case, although that would be obtuse).

rdb commented 4 years ago

I prefer canonical (mixed) case, because:

Also, if we force a particular casing, we are already ignoring one rule from BCP 47. If we then go for a casing that is different from the form that is "RECOMMENDED" by the standard, we are essentially ignoring a second one.

mvahowe commented 4 years ago

Could we call our version BCP48 to avoid confusion? Ok, maybe not, let's do it your way.

jag3773 commented 4 years ago

Agreed to NOT enforce capitalization–but we do want to recommend that the recommended capitalization scheme be used.

rdb commented 4 years ago

I'm OK with that, as long as we can have at least one document in our validation suite that has funky-cased tags, to make sure people don't rely on the "recommended" behaviour.

The Registry will probably end up canonicalising any tags that pass through it.

mvahowe commented 4 years ago

This appears to be a no-op, ie what we have in the schema now is good, so let's close this.

mvahowe commented 4 years ago

Sorry, we still need to document it, although I'm not even sure about that since we're following the spec.

jag3773 commented 4 years ago

I think this covered by https://github.com/bible-technology/scripture-burrito/blob/develop/schema/common.schema.json#L52.