OCR-D / ocrd_calamari

Recognize text using Calamari OCR and the OCR-D framework
Apache License 2.0
13 stars 6 forks source link

Handle empty images gracefully #48

Closed kba closed 3 years ago

kba commented 3 years ago

OK, I've installed OCR-d for the first time, it worked in most parts out of the box and I was able to reproduce the problem. Your errors seem to be caused by OCR-d processors, not by calamari. Somehow the line segmentation produces empty lines or lines that are outside of text regions. When the empty images are converted to numpy (by ocrd_calamari, not by calamari), numpy throws an uncaught exception. You could fix it by inserting before line 77 in ocrd_calamari/recognize.py something like line_image = line_image if all(line_image.size) else [[0]], but that's only a temporary hack to avoid the error. I'm also not sure if their workspace.image_from_segment or even the line segmentation processor is supposed to produce empty lines at all, so maybe the real problem is somewhere deeper in the guts of the OCR-d machinery.

Originally posted by @andbue in https://github.com/Calamari-OCR/calamari/issues/193#issuecomment-746979800

mikegerber commented 3 years ago

Workflow by @jbarth-ubhd that should produce the problem:

image: https://digi.ub.uni-heidelberg.de/diglitData/v/ocrd/hdz1886a_-_248_4.tif

workflow:

ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-001 -P model $HOME/ocrd_models/sbb/binarization/models ocrd-cis-ocropy-deskew -I OCR-D-001 -O OCR-D-002 ocrd-sbb-textline-detector -I OCR-D-002 -O OCR-D-003 -P model $HOME/ocrd_models/sbb/textline ocrd-calamari-recognize -I OCR-D-003 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models/gt4histocr/*.ckpt.json"

https://github.com/Calamari-OCR/calamari/issues/193#issuecomment-746798065

mikegerber commented 3 years ago

To reproduce, use this workspace: https://qurator-data.de/~mike.gerber/2021-01%20ocrd_calamari-issue-48/workspace.zip (created with the image and the commands above) and

ocrd-calamari-recognize -I OCR-D-003 -O OCR-D-OCR -P checkpoint ".../path/to/gt4histocr/*.ckpt.json"
mikegerber commented 3 years ago

Alright, bug is fixed by #49, merged.