UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

"ocr-transform page alto ... ...": loosing text #123

Closed jbarth-ubhd closed 1 year ago

jbarth-ubhd commented 4 years ago

Example page generated with OCR-D ocrd-calamari-recognize OCR_0007.zip

ocr-transform page hocr ... ... && ocr-transform hocr alto2.0 ... ... instead is loosing page size.

jbarth-ubhd commented 4 years ago

no open() syscall on any /usr/local/share/ocr-fileformat/xslt/* when doing strace -f.

But calling execve("/usr/bin/java", ["java", "-jar", "/usr/local/share/ocr-fileformat/vendor/JPageConverter/PageConverter.jar", "-neg-coords", "toZero", "-source-xml", "OCR_0007.xml", "-target-xml", "xxx", "-convert-to", "ALTO"], 0x5614283d4a10 /* 24 vars */) = 0

jbarth-ubhd commented 4 years ago

I've checked the docs of the most recent JPageConverter: -convert-to available versions:

jbarth-ubhd commented 4 years ago

Perhaps duplicate of https://github.com/PRImA-Research-Lab/prima-page-converter/issues/13

kba commented 4 years ago

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

Indeed, PAGE-ALTO conversion requires word segmentation. @maxnth Can you think of any sensible workaround?

jbarth-ubhd commented 4 years ago

Did a quick-and-dirty script: https://gist.github.com/jbarth-ubhd/0e867c20008639145386a7978fdb27a4

kba commented 4 years ago

Great but maybe we can integrate pseudo-word creation on-the-fly directly into the converter, with a cmdline flag.

maxnth commented 4 years ago

Word level PAGE XML output for calamari has already been planned for some time now but sadly we didn't get to actually implementing it yet. It's one of my next tasks though and hopefully will get included in calamari within the upcoming month. I don't know whether that's too late for this specific case but maybe the info that the feature is being worked on might help anyways.

jbarth-ubhd commented 3 years ago

seems not to be fixed in v0.4.0.

kba commented 3 years ago

seems not to be fixed in v0.4.0.

ocrd_calamari is at 1.0.0 and calamari at 1.0.5 but word-level PAGE output is indeed not implemented yet in calamari AFAICT

mikegerber commented 3 years ago

ocrd_calamari (but AFAIK not Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up earlier, I just didn't know about this issue here.

@jbarth-ubhd You need to set ocrd_calamari's parameter -P textequiv_level word.

Quoting ocrd_calamari's README:

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

ocrd_calamari does more than Calamari here because we wanted to include Calamari's glyph level infos, i.e. character positions and alternative (less probable) character predictions; and as PAGE XML has a strict line>word>glyph hierarchy, we needed to include a word segmentation. This word segmentation is inferred from the text, e.g. "Lorem ipsum dolor sit amet" becomes "Lorem| |ipsum| |dolor| |sit| |amet", strictly on spaces as expected by OCR-D's validation.

mikegerber commented 3 years ago

Indeed, PAGE-ALTO conversion requires word segmentation.

I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason. 😀

mikegerber commented 3 years ago

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion is not possible.

bertsky commented 1 year ago

No need for any of this, entirely, since we have been using https://github.com/kba/page-to-alto for this purpose instead since https://github.com/UB-Mannheim/ocr-fileformat/pull/134.

I suggest closing (cannot do it myself).