Transkribus / TranskribusCore

Note: the repo has been moved to https://gitlab.com/readcoop/Transkribus/TranskribusCore
GNU General Public License v3.0
37 stars 5 forks source link

Invalid ALTO and PAGE export #38

Open stweil opened 5 years ago

stweil commented 5 years ago

The exported ALTO and PAGE files are not valid XML. Validators complain, and the PRIMA PageViewer refuses to load such files. Tested example from the GT data set of ÖNB:

$ ocr-validate alto-2-0 ONB_aze_18950706_1.alto 
mXSDFilename: /usr/local/share/ocr-fileformat/xsd/alto-2-0.xsd
mXMLFilename: ONB_aze_18950706_1.alto
ONB_aze_18950706_1.alto fails to validate because: 

cvc-id.1: There is no ID/IDREF binding for IDREF 'Times_New_Roman_4.5_______'.
At: 1:103402

$ ocr-validate page-2013-07-15 ONB_aze_18950706_1.xml 
mXSDFilename: /usr/local/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /tmp/ONB_aze_18950706_1.xml
ONB_aze_18950706_1.xml fails to validate because: 

cvc-complex-type.2.4.d: Invalid content was found starting with element 'TranskribusMetadata'. No child element is expected at this point.
At: 12:290
hackmanschorsch commented 4 years ago

One of the two formats is fixed now.

For PAGE XML we need to publish a new XSD. But this does not mean that it can be loaded by the PRIMA PageViewer.