OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Missing text with OCR-D #217

Open jbarth-ubhd opened 2 months ago

jbarth-ubhd commented 2 months ago

after updating to ocrd/all:maximum 2024-07-10 15:00 CEST,

when OCR'ing https://digi.ub.uni-heidelberg.de/diglitData/v/faber1566_-_0075r.tif

Preview: grafik

with this workflow:

ocrd workspace init
ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff faber1566_-_0075r.tif 
ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-IMG -O OCR-D-002
ocrd-tesserocr-crop -I OCR-D-002 -O OCR-D-003
ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-003 -O OCR-D-004
ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-004 -O OCR-D-005
ocrd-tesserocr-recognize -P find_tables true -P segmentation_level region -P textequiv_level word -P model frak2021 -I OCR-D-005 -O OCR-D-OCR

I'll get this text:

15 1 4 x 8
— — * 5
ö S *
*
*
— * E
—
2

365 achie würſt nu
* 41 vnd

wo voiſ deinc Bict ö ne O odt wo iſ
+ ‚ G5 w

ſte tatt te 400 voꝛ de m hralen wege ge⸗
warnet / vnnd jmmer auff dem en⸗
gen pfade ö da zu einer ſeiten Waſſer /

When running tesseract -l frak2021 OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png pure-tesseract, (with the image after deskew) I'll get this text (see below): Image preview: grafik

geäc achie ö 6 wuſt nu
vnd ꝛder C Galle gezelet / vnd
wuſt den n Gorioſen geſang 8
Die feind * aant 0 0 +

And when running tesseract -l frak2021 faber1566_-_0075r.tif pure-tesseract-from-original I'll get

Jetzt wirſtu ruhe finden dei⸗
ner Seele. Jetzt wird dir ein Tiſch

gedecket / an welchem CH RIſtus
ſelbs der Haußknecht ſein wil. Der⸗
halben verſuch alles / wag alles / ver⸗
dag nicht / kempff hindurch / Laß die

erſchlagenen hinder dir liegen / denn

hie werden deine ihrenen alle ab⸗

getruͤcknet / deine arbeit belohnet / der
du zuuor / ein weil ein

*
*

vnder die kinder Goͤttes gezelet / vnd

wirſt den Glorioſen geſang ſingen:

Die feind ſind vberwunden. O Hell
wo iſt deine Victori? O Codt wo iſt

dein Stachel / Ey jhr ſeid alle ver⸗

ſchlungen im ſieg / 7.
Vnnd auff daß diß alles alſo an
allen Himliſchen Landferern / er⸗

ſtattet / ſie voꝛ dem breiten wege ge⸗

warnet / vnnd jmmer auff dem en⸗
gen pfade / da zu einer ſeiten Waſſer /

Ketzer vñ Sab
bath der Welt geachtet biſt / wirſt nu

And this problem — missing a lot of text in OCR-D — is occuring on approx. 70-80% of all pages (depending on the book, of course).

stweil commented 2 months ago

Does this workflow work better: https://github.com/slub/ocrd_manager/blob/main/workflows/ocr-workflow-default.sh?

jbarth-ubhd commented 2 months ago

Yeah, a pure-tesseract-only-workflow ocrd-tesserocr-recognize -P segmentation_level region -P model frak2021 -I OCR-D-IMG -O OCR-D-OCR3

gives

Jetzt wirſtu ruhe finden dei⸗
ner Seele. Jetzt wird dir ein Tiſch

gedecket / an welchem CH RIſtus
ſelbs der Haußknecht ſein wil. Der⸗
halben verſuch alles / wag alles / ver⸗
dag nicht / kempff hindurch / Laß

erſchlagenen hinder dir liegen / denn

hie werden deine ihrenen alle ab⸗

getruͤcknet / deine arbeit belohnet / der
du zuuor / ein weil ein

vnder die kinder Goͤttes gezelet / vnd

wirſt den Glorioſen geſang ſingen:

Die feind ſind vberwunden. O Hell
wo iſt deine Victori? O Codt wo iſt

dein Stachel / Ey jhr ſeid alle ver⸗

ſchlungen im ſieg / 7.
Vnnd auff daß diß alles alſo an
allen Himliſchen Landferern / er⸗

ſtattet / ſie voꝛ dem breiten wege ge⸗

warnet / vnnd jmmer auff dem en⸗
gen pfade / da zu einer ſeiten Waſſer /

Ketzer vñ Sab
bath der Welt geachtet biſt / wirſt nu

But why is tesseract so sensible to cropping & minimal deskewing?

stweil commented 2 months ago

Maybe it gets a wrong DPI value in your original workflow? Is the DPI value correct for the input image (which is the result of previous OCR-D processors)?

jbarth-ubhd commented 2 months ago

Bad that this wasn't my idea:

jb@nuc:~/faber1566$ find . \( -iname "*.png" -o -iname "*.tif" \) -printf "identify -format '%%x %%y %%U\\\n' %p\n"|bash -x
+ identify -format '%x %y %U\n' ./faber1566_-_0075r.tif
1225.29296875 1225.29296875 PixelsPerInch  # page has 8° format, dpi is +-20% correct
+ identify -format '%x %y %U\n' ./OCR-D-004/OCR-D-IMG_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR3/OCR-D-OCR3_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-002/OCR-D-002_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR2/OCR-D-OCR2_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-003/OCR-D-003_00001.IMG-CROP.png
72 72 Undefined
stweil commented 2 months ago

So the image resolution gets lost early in the OCR-D workflow in olena-binarize? I think this looks like a bug. And the following processors might have the same issue.

I wonder why nobody noticed this up to now. But maybe high resolution images like in your case are rare, and for 300 dpi images the damage is less severe.

jbarth-ubhd commented 2 months ago

I think relying on DPI without "reading distance" is not sufficient for 100% of all cases (but 99% of the "usual"): a microfilm scan might have 2540 dpi; a poster might have been scanned with 300dpi - but is typically read with meters of distance.

stweil commented 2 months ago

I agree. My example is text with huge letters written on a wall. Ideally Tesseract should not depend on DPI values.

jbarth-ubhd commented 2 months ago

At least OCR-D could try to keep resolution information. Or I'll have to write a workaround, perhaps with exiftool

jbarth-ubhd commented 2 months ago

tried workaround exiftool -tagsFromFile OCR-D-IMG/00001.tif OCR-D-.../*.png after each step, but the result is bad:

ocrdcluster/finished/faber1566/run11/0075r> ocrd-show-text OCR-D-OCR/*.xml|egrep -v '^$'
5 Die je feind ſindv pberwu unden. O 5 l
. wo in St Siachſ 0/ Ey r 7 *0 Lans
ſtatet fv voꝛ dem brahen wege ge⸗
warn 6t/ vnnd eer ün dem en⸗

Complete workspace see https://digi.ub.uni-heidelberg.de/diglitData/faber1566_-_0075r.tar

Image file resolutions:

dwork@pers109:/mnt/sds/sd22d001/ocrdcluster/finished/faber1566/run11/0075r$ identify -format "%d/%f %x %y %U\n" */*.tif */*.png*
OCR-D-IMG/00001.tif 1225.2930908203125 1225.2930908203125 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-002/OCR-D-002_00001.IMG-CROP.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-002/OCR-D-002_00001.IMG-CROP.png_original 72 72 Undefined
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png_original 72 72 Undefined
OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png 72 72 Undefined

Really astonishing is the fact, that tesseract notices correct dpi:

...
11:17:50.743 INFO processor.TesserocrCrop - INPUT FILE 0 / P_00001
11:17:51.072 INFO processor.TesserocrCrop - Page 'P_00001' images will use 1225 DPI from image meta-data
11:17:51.072 INFO processor.TesserocrCrop - Cropping with Tesseract
11:17:53.757 INFO processor.TesserocrCrop - Ignoring region 'region0000' because its width is too small (43)
11:17:53.758 INFO processor.TesserocrCrop - Ignoring region 'region0001' because its width is too small (35)
...
jbarth-ubhd commented 2 months ago

Ok... changed my workflow to you can have any resolution, as long as it's 300 dpi (convert ... -resample 300 ...). That helps.