UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Fix conversion from ALTO to PAGE and vice versa #106

Closed stweil closed 4 years ago

stweil commented 4 years ago
stweil commented 4 years ago

The prima-page-converter does not format the output file. Should we add post processing which formats it nicely? Otherwise the web interface only shows a single (very lengthy) line.

zuphilip commented 4 years ago

Yes, pretty print would be good :+1: Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?

I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml . But the third example works fine. Are these upstream problems?

zuphilip commented 4 years ago

There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml . However, this works in the docker run on the master branch. (Sorry for the the previous message which was unrelated to this issue.)

stweil commented 4 years ago

There seems to be a side effect in the abbyy2hocr transformation.

That's strange because ABBYY to hOCR does not use a transformer script (and this PR does not change anything else).

zuphilip commented 4 years ago

Okay, I did another test with this branch here and can confirm that abbyy2hocr works fine in my docker container. There seems to be a problem with the instance on digi.

stweil commented 4 years ago

Yes, pretty print would be good. Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?

There are a lot of options how to implement pretty printing. I added a commit which uses Saxon, so the usual command line argument can be used to enable it (currently only implemented for output to STDOUT). The web interface now uses pretty printing for all PAGE related conversions by default.

zuphilip commented 4 years ago

Is this ready to merge? It looks good to me!

Do you agree that the problems with the alto files I described above are upstream problems?

stweil commented 4 years ago

There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml .

Commit https://github.com/OCR-D/format-converters/commit/5b9568fd2b6dbfe891ef81826b7fffea7d21d814 was missing in our installation (fixed now). I noticed that just running make all or make install does not update any of the existing cloned code from external git repositories. I think we should start using git submodules to get explicit dependencies.

stweil commented 4 years ago

I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml.

The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.

zuphilip commented 4 years ago

The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.

Okay, we should report them upstream such that they will hopefully been fixed there in the future.

Moreover, I created an issue about a better update mechanism.

So, let me ask again: Is this ready to merge? It looks good to me! :+1:

zuphilip commented 4 years ago

Thank you @stweil for all the work on this! :bowing_man: