PRImA-Research-Lab / PAGE-XML

PAGE XML format collection for document image page content and more
Apache License 2.0
62 stars 8 forks source link

standard/norm for LanguageSimpleType #27

Open bertsky opened 3 years ago

bertsky commented 3 years ago

In PAGE-XML there's @language / @primaryLanguage of type pc:LanguageSimpleType to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14. This is problematic because exact 639 mappings are needed for software implementation and interoperability.

Take Norwegian for example:

                       <enumeration value="Norwegian"/>
                        <enumeration value="Norwegian Bokmål"/>
                        <enumeration value="Norwegian Nynorsk"/>

According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?

bertsky commented 3 years ago

(Likewise, IIUC, only the first part of the ScriptSimpleType enums is actually ISO 15924, so these would have to be split at -.)

bertsky commented 3 years ago

So IMO what needs to be done is:

  1. In the next namespace version of PAGE-XML, change ScriptSimpleType to conform to ISO 15924 and LanguageSimpleType to conform to ISO 639.
  2. Provide a (manually crafted) transformation stylesheet mapping the existing, non-standardized xs:restriction strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)
bertsky commented 3 years ago

This is @kba's workaround for the ISO 639 codes in Python (using https://github.com/LuminosoInsight/langcodes): https://github.com/kba/page-to-alto/blob/f1b67bdf70b24e6d6904ad4ba4e83ce276923aca/ocrd_page_to_alto/utils.py#L29

bertsky commented 3 years ago

Oh, and there's a file here documentation/Language List (from ISO).xlsx – but it does not contain a complete mapping of all language strings against their 639 codes.