Open jbarth-ubhd opened 2 months ago
Does this workflow work better: https://github.com/slub/ocrd_manager/blob/main/workflows/ocr-workflow-default.sh?
Yeah, a pure-tesseract-only-workflow ocrd-tesserocr-recognize -P segmentation_level region -P model frak2021 -I OCR-D-IMG -O OCR-D-OCR3
gives
Jetzt wirſtu ruhe finden dei⸗
ner Seele. Jetzt wird dir ein Tiſch
gedecket / an welchem CH RIſtus
ſelbs der Haußknecht ſein wil. Der⸗
halben verſuch alles / wag alles / ver⸗
dag nicht / kempff hindurch / Laß
erſchlagenen hinder dir liegen / denn
hie werden deine ihrenen alle ab⸗
getruͤcknet / deine arbeit belohnet / der
du zuuor / ein weil ein
vnder die kinder Goͤttes gezelet / vnd
wirſt den Glorioſen geſang ſingen:
Die feind ſind vberwunden. O Hell
wo iſt deine Victori? O Codt wo iſt
dein Stachel / Ey jhr ſeid alle ver⸗
ſchlungen im ſieg / 7.
Vnnd auff daß diß alles alſo an
allen Himliſchen Landferern / er⸗
ſtattet / ſie voꝛ dem breiten wege ge⸗
warnet / vnnd jmmer auff dem en⸗
gen pfade / da zu einer ſeiten Waſſer /
Ketzer vñ Sab
bath der Welt geachtet biſt / wirſt nu
But why is tesseract so sensible to cropping & minimal deskewing?
Maybe it gets a wrong DPI value in your original workflow? Is the DPI value correct for the input image (which is the result of previous OCR-D processors)?
Bad that this wasn't my idea:
jb@nuc:~/faber1566$ find . \( -iname "*.png" -o -iname "*.tif" \) -printf "identify -format '%%x %%y %%U\\\n' %p\n"|bash -x
+ identify -format '%x %y %U\n' ./faber1566_-_0075r.tif
1225.29296875 1225.29296875 PixelsPerInch # page has 8° format, dpi is +-20% correct
+ identify -format '%x %y %U\n' ./OCR-D-004/OCR-D-IMG_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR3/OCR-D-OCR3_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-002/OCR-D-002_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR2/OCR-D-OCR2_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-003/OCR-D-003_00001.IMG-CROP.png
72 72 Undefined
So the image resolution gets lost early in the OCR-D workflow in olena-binarize? I think this looks like a bug. And the following processors might have the same issue.
I wonder why nobody noticed this up to now. But maybe high resolution images like in your case are rare, and for 300 dpi images the damage is less severe.
I think relying on DPI without "reading distance" is not sufficient for 100% of all cases (but 99% of the "usual"): a microfilm scan might have 2540 dpi; a poster might have been scanned with 300dpi - but is typically read with meters of distance.
I agree. My example is text with huge letters written on a wall. Ideally Tesseract should not depend on DPI values.
At least OCR-D could try to keep resolution information. Or I'll have to write a workaround, perhaps with exiftool
tried workaround exiftool -tagsFromFile OCR-D-IMG/00001.tif OCR-D-.../*.png
after each step, but the result is bad:
ocrdcluster/finished/faber1566/run11/0075r> ocrd-show-text OCR-D-OCR/*.xml|egrep -v '^$'
5 Die je feind ſindv pberwu unden. O 5 l
. wo in St Siachſ 0/ Ey r 7 *0 Lans
ſtatet fv voꝛ dem brahen wege ge⸗
warn 6t/ vnnd eer ün dem en⸗
Complete workspace see https://digi.ub.uni-heidelberg.de/diglitData/faber1566_-_0075r.tar
Image file resolutions:
dwork@pers109:/mnt/sds/sd22d001/ocrdcluster/finished/faber1566/run11/0075r$ identify -format "%d/%f %x %y %U\n" */*.tif */*.png*
OCR-D-IMG/00001.tif 1225.2930908203125 1225.2930908203125 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-002/OCR-D-002_00001.IMG-CROP.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-002/OCR-D-002_00001.IMG-CROP.png_original 72 72 Undefined
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png_original 72 72 Undefined
OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png 72 72 Undefined
Really astonishing is the fact, that tesseract notices correct dpi:
...
11:17:50.743 INFO processor.TesserocrCrop - INPUT FILE 0 / P_00001
11:17:51.072 INFO processor.TesserocrCrop - Page 'P_00001' images will use 1225 DPI from image meta-data
11:17:51.072 INFO processor.TesserocrCrop - Cropping with Tesseract
11:17:53.757 INFO processor.TesserocrCrop - Ignoring region 'region0000' because its width is too small (43)
11:17:53.758 INFO processor.TesserocrCrop - Ignoring region 'region0001' because its width is too small (35)
...
Ok... changed my workflow to you can have any resolution, as long as it's 300 dpi (convert ... -resample 300 ...
). That helps.
after updating to ocrd/all:maximum 2024-07-10 15:00 CEST,
when OCR'ing https://digi.ub.uni-heidelberg.de/diglitData/v/faber1566_-_0075r.tif
Preview:
with this workflow:
I'll get this text:
When running
tesseract -l frak2021 OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png pure-tesseract
, (with the image after deskew) I'll get this text (see below): Image preview:And when running
tesseract -l frak2021 faber1566_-_0075r.tif pure-tesseract-from-original
I'll getAnd this problem — missing a lot of text in OCR-D — is occuring on approx. 70-80% of all pages (depending on the book, of course).