OCR-D / ocrd_fileformat

OCR-D wrapper for ocr-fileformat
Apache License 2.0
4 stars 3 forks source link

Fix error handling #10

Closed wrznr closed 3 years ago

wrznr commented 4 years ago

Conversion bails out with the following error while converting PAGE to ALTO:

+ ocr-transform page alto TEXT/FILE_0001_TEXT.xml ALTO/FILE_0001_ALTO.xml --
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1013)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:640)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:696)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.parse(SchemaDOMParser.java:530)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument(XSDHandler.java:2226)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:588)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:617)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:576)
    at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:542)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:276)
    at java.xml/javax.xml.validation.SchemaFactory.newSchema(SchemaFactory.java:669)
    at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:55)
    at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
    at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)
Could not initialise ALTO XML writer
java.lang.NullPointerException
    at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
    at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

Consequently, no ALTO file is created. However, an entry in the METS file is created nonetheless. I.e., while rerunning:

+ declare -a options
+ '[' -n PHYS_0001 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g PHYS_0001 -G ALTO -m application/alto+xml -i FILE_0001_ALTO ALTO/FILE_0001_ALTO.xml
Traceback (most recent call last):
  File "/home/kmw/OCR-D/env/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/cli/workspace.py", line 178, in workspace_add_file
    workspace.mets.add_file(**kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 261, in add_file
    raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='FILE_0001_ALTO' already exists

FILE_0001_TEXT.xml.zip

VolkerHartmann commented 4 years ago

I had a similar problem. XML validation fails due to import of another schema via http. e.g. in mets.xsd

The website redirect to https which is not handled correctly by the XML validator. Workaround: Check if an import is available via http and https. If so replace http by https.

kba commented 4 years ago

@wrznr The XSD in ocr-fileformat are probably empty. Can you try reinstalling ocrd-fileformat:

make uninstall install-fileformat install
kba commented 4 years ago

I had a similar problem. XML validation fails due to import of another schema via http. e.g. in mets.xsd ](http://www.w3.org/1999/xlink%22/%3E) The website redirect to https which is not handled correctly by the XML validator. Workaround: Check if an import is available via http and https. If so replace http by https.

This is unrelated, was fixed in core in https://github.com/OCR-D/core/releases/v2.10.5.

The problem @wrznr describes is most likely due to an incomplete installation of ocr-fileformat. However the script should indeed not be adding files if the conversion fails.

bertsky commented 3 years ago

Fixed by 1f33f1f, plus make install has a correct dependency on ocr-fileformat – @wrznr can this be closed?

kba commented 3 years ago

The original error is fixed upstream - failing conversions won't be added to METS and will be correctly signalled as failing to ocrd_fileformat which will just log the error and continue.