PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

fails to convert from older PAGE versions, fails to show cause #22

Open bertsky opened 2 years ago

bertsky commented 2 years ago

I am trying to make use of the PRImA Layout Dataset, which is still in namespace version 2010-01-12, and thus needs to be converted to 2019.

However, the converter fails to run on some files, e.g. XML/00000122.xml, just saying:

Error writing target PAGE XML file

This, obviously, is not helpful at all.

I managed to get the program running in debug mode with Eclipse (by importing the PrimaText.jar from the release archive of the closed prima-text source).

Eclipse had trouble identifying the writer source file, but I was able to point it manually to XmlPageWriter_2019_07_15.java and set a breakpoint there.

It turns out that the validator was dissatisfied due to validation failures caused by @primaryScript not following the enum restriction:

cvc-enumeration-valid: Value 'Latin' is not facet-valid with respect to enumeration '[Adlm - Adlam, Afak - Afaka, Aghb - Caucasian Albanian, Ahom - Ahom, Tai Ahom, Arab - Arabic, Aran - Arabic (Nastaliq variant), Armi - Imperial Aramaic, Armn - Armenian, Avst - Avestan, Bali - Balinese, Bamu - Bamum, Bass - Bassa Vah, Batk - Batak, Beng - Bengali, Bhks - Bhaiksuki, Blis - Blissymbols, Bopo - Bopomofo, Brah - Brahmi, Brai - Braille, Bugi - Buginese, Buhd - Buhid, Cakm - Chakma, Cans - Unified Canadian Aboriginal Syllabics, Cari - Carian, Cham - Cham, Cher - Cherokee, Cirt - Cirth, Copt - Coptic, Cprt - Cypriot, Cyrl - Cyrillic, Cyrs - Cyrillic (Old Church Slavonic variant), Deva - Devanagari (Nagari), Dsrt - Deseret (Mormon), Dupl - Duployan shorthand, Duployan stenography, Egyd - Egyptian demotic, Egyh - Egyptian hieratic, Egyp - Egyptian hieroglyphs, Elba - Elbasan, Ethi - Ethiopic, Geok - Khutsuri (Asomtavruli and Nuskhuri), Geor - Georgian (Mkhedruli), Glag - Glagolitic, Goth - Gothic, Gran - Grantha, Grek - Greek, Gujr - Gujarati, Guru - Gurmukhi, Hanb - Han with Bopomofo, Hang - Hangul, Hani - Han (Hanzi, Kanji, Hanja), Hano - Hanunoo (Hanunóo), Hans - Han (Simplified variant), Hant - Han (Traditional variant), Hatr - Hatran, Hebr - Hebrew, Hira - Hiragana, Hluw - Anatolian Hieroglyphs, Hmng - Pahawh Hmong, Hrkt - Japanese syllabaries, Hung - Old Hungarian (Hungarian Runic), Inds - Indus (Harappan), Ital - Old Italic (Etruscan, Oscan etc.), Jamo - Jamo, Java - Javanese, Jpan - Japanese, Jurc - Jurchen, Kali - Kayah Li, Kana - Katakana, Khar - Kharoshthi, Khmr - Khmer, Khoj - Khojki, Kitl - Khitan large script, Kits - Khitan small script, Knda - Kannada, Kore - Korean (alias for Hangul + Han), Kpel - Kpelle, Kthi - Kaithi, Lana - Tai Tham (Lanna), Laoo - Lao, Latf - Latin (Fraktur variant), Latg - Latin (Gaelic variant), Latn - Latin, Leke - Leke, Lepc - Lepcha (Róng), Limb - Limbu, Lina - Linear A, Linb - Linear B, Lisu - Lisu (Fraser), Loma - Loma, Lyci - Lycian, Lydi - Lydian, Mahj - Mahajani, Mand - Mandaic, Mandaean, Mani - Manichaean, Marc - Marchen, Maya - Mayan hieroglyphs, Mend - Mende Kikakui, Merc - Meroitic Cursive, Mero - Meroitic Hieroglyphs, Mlym - Malayalam, Modi - Modi, Moḍī, Mong - Mongolian, Moon - Moon (Moon code, Moon script, Moon type), Mroo - Mro, Mru, Mtei - Meitei Mayek (Meithei, Meetei), Mult - Multani, Mymr - Myanmar (Burmese), Narb - Old North Arabian (Ancient North Arabian), Nbat - Nabataean, Newa - Newa, Newar, Newari, Nkgb - Nakhi Geba, Nkoo - N’Ko, Nshu - Nüshu, Ogam - Ogham, Olck - Ol Chiki (Ol Cemet’, Ol, Santali), Orkh - Old Turkic, Orkhon Runic, Orya - Oriya, Osge - Osage, Osma - Osmanya, Palm - Palmyrene, Pauc - Pau Cin Hau, Perm - Old Permic, Phag - Phags-pa, Phli - Inscriptional Pahlavi, Phlp - Psalter Pahlavi, Phlv - Book Pahlavi, Phnx - Phoenician, Piqd - Klingon (KLI pIqaD), Plrd - Miao (Pollard), Prti - Inscriptional Parthian, Rjng - Rejang (Redjang, Kaganga), Roro - Rongorongo, Runr - Runic, Samr - Samaritan, Sara - Sarati, Sarb - Old South Arabian, Saur - Saurashtra, Sgnw - SignWriting, Shaw - Shavian (Shaw), Shrd - Sharada, Śāradā, Sidd - Siddham, Sind - Khudawadi, Sindhi, Sinh - Sinhala, Sora - Sora Sompeng, Sund - Sundanese, Sylo - Syloti Nagri, Syrc - Syriac, Syre - Syriac (Estrangelo variant), Syrj - Syriac (Western variant), Syrn - Syriac (Eastern variant), Tagb - Tagbanwa, Takr - Takri, Tale - Tai Le, Talu - New Tai Lue, Taml - Tamil, Tang - Tangut, Tavt - Tai Viet, Telu - Telugu, Teng - Tengwar, Tfng - Tifinagh (Berber), Tglg - Tagalog (Baybayin, Alibata), Thaa - Thaana, Thai - Thai, Tibt - Tibetan, Tirh - Tirhuta, Ugar - Ugaritic, Vaii - Vai, Visp - Visible Speech, Wara - Warang Citi (Varang Kshiti), Wole - Woleai, Xpeo - Old Persian, Xsux - Cuneiform, Sumero-Akkadian, Yiii - Yi, Zinh - Code for inherited script, Zmth - Mathematical notation, Zsye - Symbols (Emoji variant), Zsym - Symbols, Zxxx - Code for unwritten documents, Zyyy - Code for undetermined script, Zzzz - Code for uncoded script, other]'. It must be a value from the enumeration.

This should be trivial.

So I wonder:

bertsky commented 2 years ago

Also, is there any workaround for this? It seems to be a catch-22 situation: Latin is not allowed on the output side, but Latn - Latin is not allowed on the input side (where it does show a decent error message BTW). I guess I'll just have to delete all @script, @primaryScript and @secondaryScript values, right?

bertsky commented 2 years ago

For the sake of completeness: the PRImA Layout Dataset contains more problems: Some regions have no Coords/Point whatsoever.