UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Integrate PRIMA Labs PageConverter #97

Closed kba closed 4 years ago

kba commented 4 years ago

Integrates https://github.com/PRImA-Research-Lab/prima-page-converter. Currently supports ALTO -> PAGE conversion but could be extended (also accepts Google Cloud Vision, hocr, older PAGE versions and FRXML).

@wrznr @maxnth @chreul

zuphilip commented 4 years ago

Can we take https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/documentation/example/SimplePage.xml as an example mentioning in the website? Is that working fine, or better to take another one?

kba commented 4 years ago

Travis still broken, but the setup is right. Can somebody debug this?

Processing triggers for libc-bin (2.19-0ubuntu6.13) ...
tesseract -l Fraktur wetzel_reisebegleiter_1901_0021_800px.jpg stdout hocr | xmllint --format - > wetzel_reisebegleiter_1901_0021.hocr
../bin/ocr-transform.sh hocr alto2.0 wetzel_reisebegleiter_1901_0021.hocr | xmllint --format - > wetzel_reisebegleiter_1901_0021.alto
Stylesheet file /home/travis/build/UB-Mannheim/ocr-fileformat/xslt/hocr__alto2.0.xsl does not exist
-:1: parser error : Document is empty
make: *** [wetzel_reisebegleiter_1901_0021.alto] Error 1
The command "cd example && make deps roundtrip diff" exited with 2.

https://travis-ci.org/UB-Mannheim/ocr-fileformat/builds/626339369

zuphilip commented 4 years ago

The error is completely unrelated to your changes here. But there were some changes in hocr-to-ALTO today which renamed some of the scripts: https://github.com/filak/hOCR-to-ALTO/commit/5122b72ed1c6c9a6a5582a0554e45ddc658b68df . We use the most current version (master brancht) of that repo and therefore Travis is complaining.

zuphilip commented 4 years ago

I am not sure whether this is the most elegant fix, but Travis is now happy again.

kba commented 4 years ago

Looks good, thanks for fixing @zuphilip

We could add another symlink page__page2019 to upgrade PAGE files, otherwise I think this is ready to merge.

zuphilip commented 4 years ago

LGTM. @kba Let me know when this is ready from your side.

kba commented 4 years ago

We could add another symlink page__page2019

Done. I think this can be merged. :shipit:

stweil commented 4 years ago

LGTM. @kba Let me know when this is ready from your side.

@kba, is it ready now?

stweil commented 4 years ago

Thank you, @kba.

zuphilip commented 4 years ago

Thank you very much @kba !