UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Google Cloud Vision to PAGE-XML #125

Open kba opened 4 years ago

kba commented 4 years ago

It was mentioned before but @cneud just reminded me of https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page . Should not be too hard to integrate and would allow using GCV results in OCR-D/Transkribus/OCR4all.

BTW: Has anyone experience with the Azure Computer Vision API in the context of OCR? As a sign of goodwill in times of Covid-19, they are currently offering a generous free tier including access to the vision API. Would be interesting to compare.

bertsky commented 1 year ago

BTW the existing integration of GCV as part of the PRImA converter (transform gcv page linking to alto page) is broken: it delegates to java -jar PageConverter.jar -source-xml $INFILE instead of java -jar PageConverter.jar -source-json $INFILE:

https://github.com/UB-Mannheim/ocr-fileformat/blob/8878b8aaed919f500e7ad0d33e881c9d872c4fb6/script/transform/alto__page#L19

stweil commented 1 year ago

Thanks. So it was broken right from the beginning (commit 73328691c466057566db62d8cdbea8b26823bdbb).

bertsky commented 1 year ago

So it was broken right from the beginning (commit 7332869).

I'm not sure. Perhaps the PRImA convert was capable of detecting the format automatically before. But it does not look like it.

Anyway, here is a fix: https://github.com/UB-Mannheim/ocr-fileformat/pull/156

stweil commented 1 year ago

I tried it with fixed arguments, and it fails:

java -jar vendor/JPageConverter/PageConverter.jar -neg-coords toZero -source-json 1850-Baptis-EMU-0204.txt -target-xml 1850-Baptis-EMU-0204.xml -convert-to LATEST
null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.primaresearch.dla.page.Page.getLayout()" because "page" is null
    at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:449)
    at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:266)
    at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
bertsky commented 1 year ago

I tried it with fixed arguments, and it fails:

I know. That's because in this example, the input data is incomplete. See here

bertsky commented 1 year ago

Since #156 we do have a working GCV converter here based on https://github.com/PRImA-Research-Lab/prima-page-converter, so there is no actual need for https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page.

Comparing both implementations, IIUC we have:

implementation cloud-vision-ocr-to-page prima-page-converter with json input
external dependencies GCV (Java API) none (standalone)
usage online (network API) offline (JSON)
can also output ALTO no yes
yields @imageFilename yes no
yields width and height yes yes
coordinates bbox bbox
paragraphs recursive TextRegion recursive TextRegion
other region types Image+Separator+Graphic+Table Image+Separator+Graphic+Table
aggregate words to lines yes yes
confidence yes no
kba commented 1 year ago

Thanks for the comparison, very helpful.

implementation cloud-vision-ocr-to-page prima-page-converter with json input
external dependencies GCV (Java API) none (standalone)
usage online (network API) offline (JSON)

IMHO these are the strongest reasons against the cloud-vision-ocr-to-page approach.

It's unfortunate that the confidences aren't serialized, like gcv2hocr does with x_wconf for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.

bertsky commented 1 year ago

It's unfortunate that the confidences aren't serialized, like gcv2hocr does with x_wconf for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.

We can (fix ourselves and) ship our own builds. I have successfully set up Eclipse and can compile most of the modules (e.g. libs, PageViewer, PageConverter).

(I have done that with PageViewer including validator error messages.)