edmcouncil / rdf-toolkit

RDF Serializer, to be used in a git commit-hook to force automatic correct rewrite of every OWL ontology
MIT License
66 stars 18 forks source link

Serializer should not override letter casing in language and script tags #75

Closed rjyounes closed 2 weeks ago

rjyounes commented 4 months ago

While lettercase is not significant in IANA language and script tags, the following conventions typically apply:

The serializer capitalizes the script portion, thus converting "mn-Cyrl" to "mn-CYRL." Again, since the case is not significant, this doesn't produce processing errors, but there is really no reason for the serializer to modify the input.

Note that there was a related issue that was fixed.

mereolog commented 4 months ago

This feature was requested in https://github.com/edmcouncil/rdf-toolkit/issues/32 by @ElisaKendall as a remedy to the "opposite" behaviour of Protege.

I think we can add a new parameter, e.g., capitalize_suffix_in_language_tags, with the default value True to handle problem you indicated.

rjyounes commented 4 months ago

There are a number of subtag standards included in the IETF standard, and language casing depends on the standard. The most common subtags are for region, which are two uppercase letters, as in issue #32, and script, which are four title case letters (I.e., initial upper and remainder lower, as in "Cyrl"). The private use subtag x (from my issue #60) is lowercase. You can thus use number of characters to determine the appropriate letter casing. Note that the region and script subtags can stack up.

There are some tags that won't be caught in this net, such as gsw-u-sd-chzh for Zürich German, but these are quite specialized. The distinction between two- and four-letter subtags will catch most uses.

mereolog commented 4 months ago

There are a number of subtag standards included in the IETF standard, and language casing depends on the standard. The most common subtags are for region, which are two uppercase letters, as in issue #32, and script, which are four title case letters (I.e., initial upper and remainder lower, as in "Cyrl"). The private use subtag x (from my issue #60) is lowercase. You can thus use number of characters to determine the appropriate letter casing. Note that the region and script subtags can stack up.

There are some tags that won't be caught in this net, such as gsw-u-sd-chzh for Zürich German, but these are quite specialized. The distinction between two- and four-letter subtags will catch most uses.

Thanks for this insight.

Just to make sure that I got you right - do you suggest that we:

rjyounes commented 4 months ago

Yes, that's my proposal. You may want to get a second opinion because I'm not an expert in this area.

mereolog commented 4 months ago

Yes, that's my proposal. You may want to get a second opinion because I'm not an expert in this area.

@ElisaKendall could you advise?

rjyounes commented 1 month ago

I see there's a fix ready to go for this issue, and it doesn't seem you're going to get a response from @ElisaKendall. Do you feel confident enough in the fix to go ahead and merge the PR? My judgement is that it still may not be correct in all cases, but it will fix some errors and not create any new ones - thus you'll be ahead overall by including it.