OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

error in ocrd-tesserocr-deskew: PyTessBaseAPI.__cinit__ #367

Closed jbarth-ubhd closed 1 year ago

jbarth-ubhd commented 1 year ago

I've tried to run a workflow (ocrd.sif built from docker ocrd/all:maximum 2023-06-13 approx. 18:00 CEST) with this files:

https://digi.ub.uni-heidelberg.de/diglitData/v/duerer1527_-_aa2.tgz

main command = run.sh

and got the following error message (excerpt from ocrd.log):

...
+ /home/hd/hd_hd/hd_wu120/local/bin/time singularity exec -e --env-file /home/hd/hd_hd/hd_wu120/ocrd.env --env 
►MAGICK_TEMPORARY_PATH=/scratch/hd_wu120_job_948894_m08n15 --env TMPDIR=/scratch/hd_wu120_job_948894_m08n15 --env 
►TESSDATA_PREFIX=/home/hd/hd_hd/hd_wu120/ocrd_models/tessdata /home/hd/hd_hd/hd_wu120/ocrd.sif ocrd-tesserocr-
►deskew -P operation_level page -I OCR-D-003 -O OCR-D-004
GID: readonly variable
UID: readonly variable
09:07:57.923 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-deskew'
Traceback (most recent call last):
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
  File "/usr/local/lib/python3.8/site-packages/ocrd_tesserocr/deskew.py", line 61, in process
    with PyTessBaseAPI(
  File "tesserocr.pyx", line 1219, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1233, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /home/hd/hd_hd/hd_wu120/ocrd_models/tessdata/
...

see ocrd.log in linked .tgz above for complete log.

Did add an ls -l the directory named in the error message ... possibly an invalid tessdata path

Note: ocrd-tesserocr-crop did run before that without error.

Complete listing of files:

[hd_wu120@o05i15 ~]$ ls -lR ocrd_models/tessdata/
ocrd_models/tessdata/:
total 36825
drwxr-xr-x 2 hd_wu120 hd_hd     8192 Feb 15 13:04 configs
-rw-rw-r-- 1 hd_wu120 hd_hd  8628461 Feb 15 12:48 deu.traineddata
-rw-r--r-- 1 hd_wu120 hd_hd 15400601 Feb 15 14:44 eng.traineddata
-rw-r--r-- 1 hd_wu120 hd_hd  5060763 Dez  9  2021 frak2021_1.069.traineddata                                                              
-rw-r--r-- 1 hd_wu120 hd_hd  3972885 Feb 15 13:30 fra.traineddata
-rw-r--r-- 1 hd_wu120 hd_hd  4591424 Nov  3  2019 GT4HistOCR_50000000.997_191951.traineddata                                              
-rw-r--r-- 1 hd_wu120 hd_hd      572 Feb 15 13:04 pdf.ttf
drwxrwxr-x 2 hd_wu120 hd_hd     8192 Feb 15 13:24 script
drwxr-xr-x 2 hd_wu120 hd_hd     8192 Feb 15 13:04 tessconfigs

ocrd_models/tessdata/configs:
total 13
-rw-r--r-- 1 hd_wu120 hd_hd  23 Feb 15 13:04 alto
-rw-r--r-- 1 hd_wu120 hd_hd 146 Feb 15 13:04 ambigs.train
-rw-r--r-- 1 hd_wu120 hd_hd  26 Feb 15 13:04 api_config
-rw-r--r-- 1 hd_wu120 hd_hd 129 Feb 15 13:04 bigram
-rw-r--r-- 1 hd_wu120 hd_hd 311 Feb 15 13:04 box.train
-rw-r--r-- 1 hd_wu120 hd_hd 311 Feb 15 13:04 box.train.stderr
-rw-r--r-- 1 hd_wu120 hd_hd  37 Feb 15 13:04 digits
-rw-r--r-- 1 hd_wu120 hd_hd  24 Feb 15 13:04 get.images
-rw-r--r-- 1 hd_wu120 hd_hd  40 Feb 15 13:04 hocr
-rw-r--r-- 1 hd_wu120 hd_hd  59 Feb 15 13:04 inter
-rw-r--r-- 1 hd_wu120 hd_hd 101 Feb 15 13:04 kannada
-rw-r--r-- 1 hd_wu120 hd_hd  70 Feb 15 13:04 linebox
-rw-r--r-- 1 hd_wu120 hd_hd  25 Feb 15 13:04 logfile
-rw-r--r-- 1 hd_wu120 hd_hd  26 Feb 15 13:04 lstmbox
-rw-r--r-- 1 hd_wu120 hd_hd  98 Feb 15 13:04 lstmdebug
-rw-r--r-- 1 hd_wu120 hd_hd 282 Feb 15 13:04 lstm.train
-rw-r--r-- 1 hd_wu120 hd_hd  26 Feb 15 13:04 makebox
-rw-r--r-- 1 hd_wu120 hd_hd  22 Feb 15 13:04 pdf
-rw-r--r-- 1 hd_wu120 hd_hd  21 Feb 15 13:04 quiet
-rw-r--r-- 1 hd_wu120 hd_hd  65 Feb 15 13:04 rebox
-rw-r--r-- 1 hd_wu120 hd_hd 377 Feb 15 13:04 strokewidth
-rw-r--r-- 1 hd_wu120 hd_hd  22 Feb 15 13:04 tsv
-rw-r--r-- 1 hd_wu120 hd_hd 166 Feb 15 13:04 txt
-rw-r--r-- 1 hd_wu120 hd_hd  45 Feb 15 13:04 unlv
-rw-r--r-- 1 hd_wu120 hd_hd  29 Feb 15 13:04 wordstrbox

ocrd_models/tessdata/script:
total 99040
-rw-r--r-- 1 hd_wu120 hd_hd 101402885 Feb 15 13:30 Latin.traineddata

ocrd_models/tessdata/tessconfigs:
total 3
-rw-r--r-- 1 hd_wu120 hd_hd  49 Feb 15 13:04 batch
-rw-r--r-- 1 hd_wu120 hd_hd  37 Feb 15 13:04 batch.nochop
-rw-r--r-- 1 hd_wu120 hd_hd 243 Feb 15 13:04 matdemo
-rw-r--r-- 1 hd_wu120 hd_hd 368 Feb 15 13:04 msdemo
-rw-r--r-- 1 hd_wu120 hd_hd   1 Feb 15 13:04 nobatch
-rw-r--r-- 1 hd_wu120 hd_hd 295 Feb 15 13:04 segdemo
bertsky commented 1 year ago

As the readme states, deskew needs osd.traineddata.

Your TESSDATA_PREFIX approach gets you using a custom model directory for Tesseract which has not been filled by the installer (ocrd_all compiles it with /usr/local/share/tessdata, also used as module resource location for ocrd_tesserocr).

Please follow the OCR-D user guide for Docker, which gives the following cmdline:

docker run --user $(id -u) --workdir /data --volume $PWD:/data --volume $PWD/models:/usr/local/share/ocrd-resources --volume $PWD/models:/usr/local/share/tessdata --volume $PWD/models:/usr/local/share/ocrd-resources -it ocrd/all bash

(where bash can of course be replaced with any single processor call or workflow script)

jbarth-ubhd commented 1 year ago

Thanks!