OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

[Model Documentation] Fraktur call is not correct #567

Closed GrazingScientist closed 4 years ago

GrazingScientist commented 4 years ago

Used Docker image docker.io/ocrd/all maximum 7bfeac60c4cb 5 days ago 12.4 GB

Problem Description Using the OCRD Docker image, the call to the tesseract Fraktur model given in the documentation here fails.

When I install the Fraktur model via apt install tesseract-ocr-script-frak, the call to ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS_DEU_FRAK -p '{"model": "deu+frk"}' as given in the documentation fails.

The correct call would be:

ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS_DEU_FRAK -p '{"model": "deu+Fraktur"}'

Also, by the way, I had to set TESSDATA_PREFIX in the Dockerfile as

ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

because otherwise the models are not found. Can you reproduce this?

bertsky commented 4 years ago

I am afraid your case – ocrd_all via Docker image – is not covered by ocrd.de documentation yet. So that's actually a documentation issue (should be moved to ocrd-website). Or you could say it's an ocrd_all issue, but not core.

The situation in the Docker image is different, because Tesseract has been installed from source there (not apt), so models reside in a custom path (/usr/local/share/tessdata), but you cannot use the ocrd_all make rules to fetch additional models.

Also, by the way, I had to set TESSDATA_PREFIX in the Dockerfile as

ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

because otherwise the models are not found

This makes your ocrd_tesserocr use only models you can install via apt. The default path for the Docker version is /usr/local/share/tessdata (with only the minimal osd/eng/equ installed). It's not exported to the shell, though.

I agree we should find a better solution (either use the apt default path, or at least export TESSDATA_PREFIX correctly).

Also, note that frk and Fraktur are actually different kinds of fraktur models:

You can even combine them: frk+Fraktur.

GrazingScientist commented 4 years ago

I am afraid your case – ocrd_all via Docker image – is not covered by ocrd.de documentation yet. So that's actually a documentation issue (should be moved to ocrd-website). Or you could say it's an ocrd_all issue, but not core.

Thank you for pointing this out. I have to admit that I posted this in core out of habit. Sorry for that! Is it possible to move the ticket or shall I open a new one in OCR-D website?

I agree we should find a better solution (either use the apt default path, or at least export TESSDATA_PREFIX correctly).

This would be awesome, since I was very confused (but fortunately ran into this situation before already).

Also, note that frk and Fraktur are actually different kinds of fraktur models:

frk / tesseract-ocr-frk: unlike the apt description, this is not Frankish, but a modern (LSTM-based) Fraktur model for German Fraktur / tesseract-ocr-script-frak: pure (without LM/dict) LSTM-based Fraktur model for all languages

You can even combine them: frk+Fraktur.

This differentiation should definitely be covered in the documentation!

Edit: And thanks for all the insights! :)