Page-xml files fail at validation

ghost commented 5 years ago

@NesbiDevelopment @chreul @b-eyselein @ChWick Thank you for your hard work,

The page-xml files created using Larex fail at validation. Thus, PageViewer is not viewing the file. sample files: example.zip Validation is important, insuring the files wont cause errors at P2PaLA.

wget https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd
xmllint --noout --schema pagecontent.xsd 0011.xml

Validation results:

0011.xml:1: element TextLine: Schemas validity error : Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextLine': This element is not expected. Expected is one of ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextEquiv, {http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextStyle ).
0011.xml:1: element TextLine: Schemas validity error : Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextLine': This element is not expected. Expected is one of ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextEquiv, {http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15}TextStyle ).

Nesbi commented 5 years ago

This is strange, since we are using the official PRImA-Research-Lab/prima-core-libs for reading and creating our PAGE xmls.

Could you give me more information on how this xml has been created? Did you use the newest version of Larex? Was this on Larex stand alone or in OCR4all?

Because I'm unable to create such an xml myself. Opening and saving your xml file with Larex, creates a valid PAGE xml file.

Edit: The problem seems to be that an empty TextEquiv is outside of the TextLine. Not sure how that happened.

ghost commented 5 years ago

hmmm.... I installed using docker pull ls6uniwue/ocr4all it validates now. Thanks man

OCR4all / LAREX

Page-xml files fail at validation #141