UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Replace broken Travis CI by GitHub action #168

Closed stweil closed 11 months ago

stweil commented 12 months ago

The CI currently fails for no obvious reason when running PageConverter.jar to convert an ALTO file (which looks good) to PAGE XML, like it is done manually in this command:

LANG=C.UTF-8 java -jar ../vendor/JPageConverter/PageConverter.jar -source-xml wetzel_reisebegleiter_1901_0021.alto -target-xml out -convert-to LATEST
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1465)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1013)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:542)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:640)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:696)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.parse(SchemaDOMParser.java:530)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument(XSDHandler.java:2227)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:589)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:618)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:577)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:543)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:281)
    at java.xml/javax.xml.validation.SchemaFactory.newSchema(SchemaFactory.java:612)
    at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:55)
    at org.primaresearch.dla.page.io.xml.XmlPageReader.createMainParser(XmlPageReader.java:82)
    at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:176)
    at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130)
    at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:230)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
kba commented 11 months ago

I tried the call directly too but for me it fails because

java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.loc.gov/standards/alto/alto.xsd                                                             

because apparently the LoC is using Cloudflare which does not allow calling with the user agent of PageConverter.jar...

If you have an idea how to get past that, I can investigate further, for now I am stuck.

One guess would be that <?xml version="1.0" encoding="UTF-8"?> might be the reason for failing on the first character of the first line.

stweil commented 11 months ago

Strange. Why do you see an IOException for an http URL, although the ALTO file uses an https URL?

The error message which I get with "lineNumber: 1; columnNumber: 1" is misleading. The ALTO input is processed and converted to a PAGE XML file which looks correct. So the SAXParseException occurs after the conversion. Nothing changes if I remove line 1 from the ALTO file.

stweil commented 11 months ago

I'll merge this pull request. The PageConverter issue will be handled separately.