Open kba opened 4 years ago
BTW the existing integration of GCV as part of the PRImA converter (transform gcv page
linking to alto page
) is broken: it delegates to java -jar PageConverter.jar -source-xml $INFILE
instead of java -jar PageConverter.jar -source-json $INFILE
:
Thanks. So it was broken right from the beginning (commit 73328691c466057566db62d8cdbea8b26823bdbb).
So it was broken right from the beginning (commit 7332869).
I'm not sure. Perhaps the PRImA convert was capable of detecting the format automatically before. But it does not look like it.
Anyway, here is a fix: https://github.com/UB-Mannheim/ocr-fileformat/pull/156
I tried it with fixed arguments, and it fails:
java -jar vendor/JPageConverter/PageConverter.jar -neg-coords toZero -source-json 1850-Baptis-EMU-0204.txt -target-xml 1850-Baptis-EMU-0204.xml -convert-to LATEST
null
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.primaresearch.dla.page.Page.getLayout()" because "page" is null
at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:449)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:266)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
I tried it with fixed arguments, and it fails:
I know. That's because in this example, the input data is incomplete. See here
Since #156 we do have a working GCV converter here based on https://github.com/PRImA-Research-Lab/prima-page-converter, so there is no actual need for https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page.
Comparing both implementations, IIUC we have:
implementation | cloud-vision-ocr-to-page | prima-page-converter with json input |
external dependencies | GCV (Java API) | none (standalone) |
usage | online (network API) | offline (JSON) |
can also output ALTO | no | yes |
yields @imageFilename |
yes | no |
yields width and height | yes | yes |
coordinates | bbox | bbox |
paragraphs | recursive TextRegion | recursive TextRegion |
other region types | Image+Separator+Graphic+Table | Image+Separator+Graphic+Table |
aggregate words to lines | yes | yes |
confidence | yes | no |
Thanks for the comparison, very helpful.
implementation | cloud-vision-ocr-to-page | prima-page-converter with json input |
external dependencies | GCV (Java API) | none (standalone) |
usage | online (network API) | offline (JSON) |
IMHO these are the strongest reasons against the cloud-vision-ocr-to-page
approach.
It's unfortunate that the confidences aren't serialized, like gcv2hocr does with x_wconf
for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.
It's unfortunate that the confidences aren't serialized, like gcv2hocr does with
x_wconf
for hOCR though, but with development largely stalled, nothing much we can do except rewrite ourselves.
We can (fix ourselves and) ship our own builds. I have successfully set up Eclipse and can compile most of the modules (e.g. libs, PageViewer, PageConverter).
(I have done that with PageViewer including validator error messages.)
It was mentioned before but @cneud just reminded me of https://github.com/PRImA-Research-Lab/cloud-vision-ocr-to-page . Should not be too hard to integrate and would allow using GCV results in OCR-D/Transkribus/OCR4all.
BTW: Has anyone experience with the Azure Computer Vision API in the context of OCR? As a sign of goodwill in times of Covid-19, they are currently offering a generous free tier including access to the vision API. Would be interesting to compare.