Open bertsky opened 3 years ago
(Likewise, IIUC, only the first part of the ScriptSimpleType
enums is actually ISO 15924, so these would have to be split at -
.)
So IMO what needs to be done is:
ScriptSimpleType
to conform to ISO 15924 and LanguageSimpleType
to conform to ISO 639.xs:restriction
strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)This is @kba's workaround for the ISO 639 codes in Python (using https://github.com/LuminosoInsight/langcodes): https://github.com/kba/page-to-alto/blob/f1b67bdf70b24e6d6904ad4ba4e83ce276923aca/ocrd_page_to_alto/utils.py#L29
Oh, and there's a file here documentation/Language List (from ISO).xlsx
– but it does not contain a complete mapping of all language strings against their 639 codes.
In PAGE-XML there's
@language
/@primaryLanguage
of typepc:LanguageSimpleType
to identify the natural language of segments. Its documentation refers toISO 639.x 2016-07-14
, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for2016-07-14
. This is problematic because exact 639 mappings are needed for software implementation and interoperability.Take Norwegian for example:
According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?