PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

java.lang.NullPointerException with negative coordinates #15

Open stweil opened 3 years ago

stweil commented 3 years ago

PageConverter crashes when given a negative coordinate even with -neg-coords toZero:

java -jar JPageConverter/PageConverter.jar -source-xml in.xml -target-xml out.xml -convert-to ALTO -neg-coords toZero
Exception in thread "main" java.lang.NullPointerException
    at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

The same exception also occurs with PAGE XML input which has no TextRegion but an empty ReadingOrder. That is not valid PAGE XML, but could perhaps be tolerated, too.

chris1010010 commented 3 years ago

Hi, Thanks for pointing this out. Do you have example files handy?

stweil commented 3 years ago

Sure, here is an example: https://ub-backup.bib.uni-mannheim.de/~stweil/prima-page-converter-issue-15/.

I just added a minus to one of the coordinates to make the conversion fail, even with the latest release.

chris1010010 commented 3 years ago

This is because the converter doesn't load invalid XMLs. The exception is thrown because the page object is null

kba commented 3 years ago

This is because the converter doesn't load invalid XMLs

The samples @stweil posted are valid PAGE 2019.

stweil commented 3 years ago

Yes, but I explained above how to make them invalid by adding a minus which triggers the crash. We had negative coordinates in earlier releases of OCR-D.

kba commented 3 years ago

I mean that https://ub-backup.bib.uni-mannheim.de/~stweil/prima-page-converter-issue-15/FILE_0006_OCR-D-OCR-TESS-bad.xml does have a negative coordinate in region region0003_line0001_word0003 but is still valid according to the schema, so

the converter doesn't load invalid XMLs

does not seem to answer the question.

jbarth-ubhd commented 2 years ago

Here an Example created with Abbyy Finereader SDK which gives NullPointerException:

https://digi.ub.uni-heidelberg.de/diglitData/v/justinian1627bd1_-_0009.abbyy.xml

> java -jar ~/ocr-fileformat/vendor/JPageConverter/PageConverter.jar -source-xml 0009.line.xml -target-xml 0009.page.xml -neg-coords toZero
Exception in thread "main" java.lang.NullPointerException
    at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:449)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:266)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
jbarth-ubhd commented 2 years ago

minimalistic text-only tool from me: https://gist.github.com/jbarth-ubhd/4826031b9de3b9c394be0da40bee14b6