Closed stweil closed 4 years ago
The prima-page-converter does not format the output file. Should we add post processing which formats it nicely? Otherwise the web interface only shows a single (very lengthy) line.
Yes, pretty print would be good :+1: Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?
I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml . But the third example works fine. Are these upstream problems?
There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml . However, this works in the docker run on the master branch. (Sorry for the the previous message which was unrelated to this issue.)
There seems to be a side effect in the abbyy2hocr transformation.
That's strange because ABBYY to hOCR does not use a transformer script (and this PR does not change anything else).
Okay, I did another test with this branch here and can confirm that abbyy2hocr works fine in my docker container. There seems to be a problem with the instance on digi.
Yes, pretty print would be good. Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?
There are a lot of options how to implement pretty printing. I added a commit which uses Saxon, so the usual command line argument can be used to enable it (currently only implemented for output to STDOUT). The web interface now uses pretty printing for all PAGE related conversions by default.
Is this ready to merge? It looks good to me!
Do you agree that the problems with the alto files I described above are upstream problems?
There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml .
Commit https://github.com/OCR-D/format-converters/commit/5b9568fd2b6dbfe891ef81826b7fffea7d21d814 was missing in our installation (fixed now). I noticed that just running make all
or make install
does not update any of the existing cloned code from external git repositories. I think we should start using git submodules to get explicit dependencies.
I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml.
The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.
The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.
Okay, we should report them upstream such that they will hopefully been fixed there in the future.
Moreover, I created an issue about a better update mechanism.
So, let me ask again: Is this ready to merge? It looks good to me! :+1:
Thank you @stweil for all the work on this! :bowing_man:
-convert-to ALTO
argument needed for conversion from PAGE to ALTO