Mewel / abbyy-to-alto

Converts FineReader abbyy.xml to alto.xml.
MIT License
9 stars 5 forks source link

NullPointerException at at com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:540) #3

Closed ponchofiesta closed 5 years ago

ponchofiesta commented 5 years ago

As this exception is thrown in an Java XML class I assume it's hard to fix but maybe it is caused in your library. When trying to convert this file, A NullPointerException is thrown.

Using OpenJDK-8 on Ubuntu 16.04

java.lang.NullPointerException: null
    at com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:540)
    at com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:562)
    at com.sun.xml.internal.bind.v2.runtime.reflect.Lister$IDREFSIterator.next(Lister.java:442)
    at com.sun.xml.internal.bind.v2.runtime.reflect.Lister$IDREFSIterator.next(Lister.java:419)
    at com.sun.xml.internal.bind.v2.runtime.reflect.ListTransducedAccessorImpl.print(ListTransducedAccessorImpl.java:100)
    at com.sun.xml.internal.bind.v2.runtime.reflect.ListTransducedAccessorImpl.print(ListTransducedAccessorImpl.java:42)
    at com.sun.xml.internal.bind.v2.runtime.property.AttributeProperty.serializeAttributes(AttributeProperty.java:86)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeAttributes(ClassBeanInfoImpl.java:368)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:674)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681)
    at com.sun.xml.internal.bind.v2.runtime.property.SingleElementNodeProperty.serializeBody(SingleElementNodeProperty.java:143)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157)
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681)
    at com.sun.xml.internal.bind.v2.runtime.property.SingleElementNodeProperty.serializeBody(SingleElementNodeProperty.java:143)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsSoleContent(XMLSerializer.java:578)
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeRoot(ClassBeanInfoImpl.java:326)
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsRoot(XMLSerializer.java:479)
    at com.sun.xml.internal.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:308)
    at com.sun.xml.internal.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:236)
    at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:95)
    at org.mycore.xml.JAXBUtil.marshalAlto(JAXBUtil.java:68)
    at org.kitodo.mediaserver.core.conversion.ocr.AbbyyToAltoOcrConverter.convert(AbbyyToAltoOcrConverter.java:68)
    at org.kitodo.mediaserver.core.actions.AbbyyToAltoOcrConvertAction.lambda$perform$0(AbbyyToAltoOcrConvertAction.java:87)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at org.kitodo.mediaserver.core.actions.AbbyyToAltoOcrConvertAction.perform(AbbyyToAltoOcrConvertAction.java:82)
    at org.kitodo.mediaserver.core.services.ActionService.performImmediately(ActionService.java:149)
    at org.kitodo.mediaserver.core.services.ActionService$$FastClassBySpringCGLIB$$6de31ad7.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:685)
    at org.kitodo.mediaserver.core.services.ActionService$$EnhancerBySpringCGLIB$$ef80a675.performImmediately(<generated>)
    at org.kitodo.mediaserver.importer.control.ImporterFlowControl.performActions(ImporterFlowControl.java:386)
    at org.kitodo.mediaserver.importer.control.ImporterFlowControl.importWorks(ImporterFlowControl.java:332)
    at org.kitodo.mediaserver.cli.commands.ImportCommand.importWorks(ImportCommand.java:126)
    at org.kitodo.mediaserver.cli.commands.ImportCommand.call(ImportCommand.java:168)
    at org.kitodo.mediaserver.cli.Terminal.run(Terminal.java:84)
    at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:790)
    at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:774)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:335)
    at org.kitodo.mediaserver.cli.CliApplication.main(CliApplication.java:42)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.boot.maven.AbstractRunMojo$LaunchRunner.run(AbstractRunMojo.java:496)
    at java.lang.Thread.run(Thread.java:748)
Could not convert OCR file '/srv/kitodo/mediaserver/files/BV040105361/serlarch_BV040105361_xml/serlarch_bv040105361_0303.xml'.
java.lang.NullPointerException: null
    at com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:540) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getBeanInfo(JAXBContextImpl.java:562) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.reflect.Lister$IDREFSIterator.next(Lister.java:442) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.reflect.Lister$IDREFSIterator.next(Lister.java:419) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.reflect.ListTransducedAccessorImpl.print(ListTransducedAccessorImpl.java:100) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.reflect.ListTransducedAccessorImpl.print(ListTransducedAccessorImpl.java:42) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.AttributeProperty.serializeAttributes(AttributeProperty.java:86) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeAttributes(ClassBeanInfoImpl.java:368) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:674) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.SingleElementNodeProperty.serializeBody(SingleElementNodeProperty.java:143) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementNodeProperty.serializeItem(ArrayElementNodeProperty.java:54) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayElementProperty.serializeListBody(ArrayElementProperty.java:157) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.ArrayERProperty.serializeBody(ArrayERProperty.java:144) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsXsiType(XMLSerializer.java:681) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.property.SingleElementNodeProperty.serializeBody(SingleElementNodeProperty.java:143) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeBody(ClassBeanInfoImpl.java:345) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsSoleContent(XMLSerializer.java:578) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.ClassBeanInfoImpl.serializeRoot(ClassBeanInfoImpl.java:326) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.XMLSerializer.childAsRoot(XMLSerializer.java:479) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:308) ~[na:1.8.0_191]
    at com.sun.xml.internal.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:236) ~[na:1.8.0_191]
    at javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:95) ~[na:1.8.0_191]
    at org.mycore.xml.JAXBUtil.marshalAlto(JAXBUtil.java:68) ~[abbyy-to-alto-8c784b5b6f03d94e5e9771940a8bec9015a9c210.jar:na]
    at org.kitodo.mediaserver.core.conversion.ocr.AbbyyToAltoOcrConverter.convert(AbbyyToAltoOcrConverter.java:68) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.core.actions.AbbyyToAltoOcrConvertAction.lambda$perform$0(AbbyyToAltoOcrConvertAction.java:87) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) ~[na:1.8.0_191]
    at org.kitodo.mediaserver.core.actions.AbbyyToAltoOcrConvertAction.perform(AbbyyToAltoOcrConvertAction.java:82) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.core.services.ActionService.performImmediately(ActionService.java:149) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.core.services.ActionService$$FastClassBySpringCGLIB$$6de31ad7.invoke(<generated>) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-5.0.4.RELEASE.jar:5.0.4.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:685) ~[spring-aop-5.0.4.RELEASE.jar:5.0.4.RELEASE]
    at org.kitodo.mediaserver.core.services.ActionService$$EnhancerBySpringCGLIB$$ef80a675.performImmediately(<generated>) ~[kitodo-mediaserver-core-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.importer.control.ImporterFlowControl.performActions(ImporterFlowControl.java:386) ~[kitodo-mediaserver-importer-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.importer.control.ImporterFlowControl.importWorks(ImporterFlowControl.java:332) ~[kitodo-mediaserver-importer-1.0-SNAPSHOT.jar:1.0-SNAPSHOT]
    at org.kitodo.mediaserver.cli.commands.ImportCommand.importWorks(ImportCommand.java:126) ~[classes/:na]
    at org.kitodo.mediaserver.cli.commands.ImportCommand.call(ImportCommand.java:168) ~[classes/:na]
    at org.kitodo.mediaserver.cli.Terminal.run(Terminal.java:84) ~[classes/:na]
    at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:790) ~[spring-boot-2.0.0.RELEASE.jar:2.0.0.RELEASE]
    at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:774) ~[spring-boot-2.0.0.RELEASE.jar:2.0.0.RELEASE]
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:335) ~[spring-boot-2.0.0.RELEASE.jar:2.0.0.RELEASE]
    at org.kitodo.mediaserver.cli.CliApplication.main(CliApplication.java:42) ~[classes/:na]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_191]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_191]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_191]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_191]
    at org.springframework.boot.maven.AbstractRunMojo$LaunchRunner.run(AbstractRunMojo.java:496) ~[na:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_191]

This is the code I used to convert the file:

        // Read ABBYY file
        Document abbyyDocument;
        try (InputStream inputStream = Files.newInputStream(sourceFile, StandardOpenOption.READ)) {
            abbyyDocument = JAXBUtil.unmarshalAbbyyDocument(inputStream);
        }

        // Convert to ALTO
        Alto alto;
        try {
            AbbyyToAltoConverter converter = new AbbyyToAltoConverter();
            converter.setEnableConfidence(false);
            alto = converter.convert(abbyyDocument);
        } catch (Exception ex) {
            throw new Exception("Could not convert OCR file '" + sourceFile + "'", ex);
        }

        // Write ALTO file
        if (!Files.exists(destFile.getParent())) {
            Files.createDirectories(destFile.getParent());
        }
        try (OutputStream outStream = Files.newOutputStream(destFile)) {
            JAXBUtil.marshalAlto(alto, outStream);
        }
Mewel commented 5 years ago

The code is ok I guess. I just looked at your xml and it seems that this part between line 5824 - 5826 causes the error.

`

` If you remove it everything works fine. Maybe you can look deeper into this. **EDIT:** The charParams is missing the content.
ponchofiesta commented 5 years ago

OK but AFAIK this is valid. A space character is a character too. Shouldn't this be converted to SP in Alto? Or is the problem that this is the only char in this line/paragraph? I think this should be caught by the library then(?)

EDIT: OK, XML ignores white space characters. Maybe there is a way to enforce white-space chars.

EDIT: I don't think this is the problem. There are several chars that contain only a whitespace character. They all are converted to CONTENT="&nbsp;". But when there is one single whitespace char in a line and paragraph, it crashes.

Mewel commented 5 years ago

Hm, what I can do is ignoring it. That would be a fast fix. Something like (I would do it a bit more pretty cause Im not sure what happens if there are multiple spaces but no other content):

int contentSize = altoLine.getStringAndSP().size(); if(contentSize == 0 || (contentSize <= 1 && altoLine.getStringAndSP().get(0) instanceof SP)) { textBlock.getTextLine().remove(altoLine); }

That would leave an empty TextBlock. To get rid of this I would also add an additional "if":

if(!paragraphBlock.getTextLine().isEmpty()) { composedBlock.getContent().add(paragraphBlock); }

For me that would be fine and I can easily add it. And I dont think you really lose any information. What you think?

ponchofiesta commented 5 years ago

In my code it would be OK. My ALTO reader skips SP too. In my workflow the OCR data is written to PDFs with the scanned images. So I don't need whitespaces. Not sure if someone would need those data. But I don't want to modify our "raw" XMLs. If it throws exceptions, nobody needs it :)

Mewel commented 5 years ago

I hope not :). Should be fixed now.

ponchofiesta commented 5 years ago

I'll test it on monday. Thank you!

ponchofiesta commented 5 years ago

I tested it with one problematic file and it worked.