ISO 639-1 is not sufficient for language fields - Githubissues

SAP / abap-file-formats

File formats that define and specify the file representation for ABAP development objects

MIT License

58 stars 55 forks source link

ISO 639-1 is not sufficient for language fields #582

Closed schneidermic0 closed 6 months ago

schneidermic0 commented 9 months ago

Currently, it is specified that language fields follow ISO 639-1.

See:

SAP language code representing all possible languages does differentiate not only for the language but also for countries. ISO639-1 does not specify any country information in the language code.

For example there are differentiations in SAP system for different countries with language English. As SAP language code "EN" represents English United States SAP language code "6N" supports English United Kingdom. There are further country-specific SAP language codes for Englsish. But all are represented by ISO 639-1 language code "en".

Same issue also exists for other languages (Arabic, Chinese, Dutch, French, German or Spanish).

See SAP Note https://launchpad.support.sap.com/#/notes/73606.

During serialisation and transforming SAP language into the ISO 639-1 code, the information of the country is lost (or the wrong language code might be stored in the system).

schneidermic0 commented 9 months ago

Instead of ISO 639-1 we could use following option to represent SAP language

SAP language code like "EN", "6N" ...
Locale like "en_US", "en_GB", ... Combination of ISO 639-1 language and ISO 3166-1 country code separated by an underscore*

*) Remark: I saw many examples where the locale is rendered with a hyphen instead of an underscore (e.g., "en-US", "en-GB", ...). SAP's APIs serialize it with an underscore.

wurzka commented 7 months ago

Similar topic was also discussed in https://github.com/SAP/abap-file-formats/issues/34

schneidermic0 commented 7 months ago

https://en.wikipedia.org/wiki/IETF_language_tag

schneidermic0 commented 7 months ago

https://www.w3.org/International/questions/qa-choosing-language-tags

schneidermic0 commented 7 months ago

Beside the options I mentioned above, we could also use BCP47 language tags.

See also:

RFC5646 section 4.1 states following:

Use as precise a tag as possible, but no more specific than is justified. Avoid using subtags that are not important for distinguishing content in an application.

As far as I understand this section, we could stay with our existing language tags (e.g., "en" for SAP language "EN" representing English (US)), but can additional information as soon as it is needed. I.e., if it should be English (Great Britain), we could use language tag "en-GB". Same would be valid for any other region/script for the English language.

schneidermic0 commented 7 months ago

I have also checked, how SAP's I18N converter classes work (cl_i18n_languages) for BCP47:

If you convert "en" or "en-US" to SAP1-language, it will return in both cases the same SAP1-language. If you do the same for "en-GB" it will return a different language.

If you convert from a SAP1-language to BCP-language, it will always return the full tag (e.g., "E" will be converted to "en-US". However, here we could (not sure, yety whether we should) shorten the tag to "en".

I tested the behavior (describe above for English) with the language above also with several other languages like German or Chinese. It was the same.

schneidermic0 commented 7 months ago

Necessary steps to address this issue:

[x] Decision which approach to follow
[x] Adapt schema generator in the tools repository (see https://github.com/SAP/abap-file-formats-tools/pull/310)
[x] Adapt all schemas in this repository (see #607)
[x] Update documentation in this repository (see #611)
[x] Add list of all supported languages based on SAP Note https://launchpad.support.sap.com/#/notes/73606 (see #611)

schneidermic0 commented 6 months ago

Decision: We plan to follow the approach of BCP47 language tags (see above). Whenever possible we stick to short language tags using the main language only, whenever possible.

schneidermic0 commented 6 months ago

Theoretically, we could replace the existing pattern ("^[a-z]+$") in the schema with value "^[a-z]{2,3}(?:-[A-Z][a-z]{3})?(?:-[A-Z]{2})?$" to address all languages supported by SAP (which is a subset of BCP47 language tags)

We think this would be somehow over engineered. We don't have patterns for other fields so far. Any objections?

This means the schema will only have the addition "minLength": 2.

Old code for original Language

        "originalLanguage": {
          "title": "Original Language",
          "description": "Original language of the ABAP object",
          "type": "string",
          "minLength": 2,
          "maxLength": 2,
          "pattern": "^[a-z]+$"
        },

New code for original language

        "originalLanguage": {
          "title": "Original Language",
          "description": "Original language of the ABAP object",
          "type": "string",
          "minLength": 2
        },

schneidermic0 commented 6 months ago

Maybe, it is more helpful if we list all supported languages based on SAP Note https://launchpad.support.sap.com/#/notes/73606 in our documentation

schneidermic0 commented 6 months ago

I think all necessary steps for the repository are done. @Markus1812 Thanks for your contributions.

I close this issue :)