OCR-D / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
13 stars 5 forks source link

script and language attributes #3

Open bertsky opened 3 years ago

bertsky commented 3 years ago

For @primaryLanguage, @secondaryLanguage, @language, we have @LANG, but that's an xsd:lang type, i.e. supposed to be ISO 639-2 or -3, so one would still have to enumerate the mapping due to https://github.com/PRImA-Research-Lab/PAGE-XML/issues/27

For @primaryScript, @secondaryScript, @script I don't see any possibility for representation.

kba commented 3 years ago

I've used https://github.com/LuminosoInsight/langcodes to implement the mapping from the long-form ISO 639 language names PAGE-XML defines for LanguageSimpleType to ISO 639-3 representation. That works pretty well with all the languages I've tried.

bertsky commented 3 years ago

I've used https://github.com/LuminosoInsight/langcodes to implement the mapping from the long-form ISO 639 language names PAGE-XML defines for LanguageSimpleType to ISO 639-3 representation. That works pretty well with all the languages I've tried.

Awesome!