OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
71 stars 18 forks source link

tesserocr-deskew - directory $TESSDATA_PREFIX ? #351

Closed jbarth-ubhd closed 1 year ago

jbarth-ubhd commented 1 year ago

PS: ocrd.sif is from ocrd/all:2022-08-15

> singularity exec -e --env-file /home/hd/hd_hd/hd_wu120/ocrd.env --env MAGICK_TEMPORARY_PATH=/scratch/
►hd_wu120_job_700507_p01n10 --env TMPDIR=/scratch/hd_wu120_job_700507_p01n10 --env TESSDATA_PREFIX=/home/hd/hd_hd/
►hd_wu120/ocrd_models/tessdata /home/hd/hd_hd/hd_wu120/ocrd.sif ocrd-tesserocr-deskew -P operation_level page -I 
►OCR-D-003 -O OCR-D-004
GID: readonly variable
UID: readonly variable
Traceback (most recent call last):
  File "/usr/local/bin/ocrd-tesserocr-deskew", line 8, in <module>
    sys.exit(ocrd_tesserocr_deskew())
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 58, in ocrd_tesserocr_deskew
    return ocrd_cli_wrap_processor(TesserocrDeskew, *args, **kwargs)
  File "/build/core/ocrd/ocrd/decorators/__init__.py", line 108, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/usr/local/lib/python3.6/site-packages/ocrd_tesserocr/deskew.py", line 68, in process
    psm=PSM.AUTO_OSD
  File "tesserocr.pyx", line 1219, in tesserocr.PyTessBaseAPI.__cinit__
    self._init_api(cpath, clang, oem, NULL, 0, NULL, NULL, False, psm)
  File "tesserocr.pyx", line 1233, in tesserocr.PyTessBaseAPI._init_api
    raise RuntimeError('Failed to init API, possibly an invalid tessdata path: {}'.format(path))
RuntimeError: Failed to init API, possibly an invalid tessdata path: /home/hd/hd_hd/hd_wu120/ocrd_models/tessdata
Command exited with non-zero status 1

but the directory does contains this files:

$ cd /home/hd/hd_hd/hd_wu120/ocrd_models/tessdata && find . -type f -printf "%-30p %9s\n"|sort
./configs/alto                        23
./configs/ambigs.train               146
./configs/api_config                  26
./configs/bigram                     129
./configs/box.train                  311
./configs/box.train.stderr           311
./configs/digits                      37
./configs/get.images                  24
./configs/hocr                        40
./configs/inter                       59
./configs/kannada                    101
./configs/linebox                     70
./configs/logfile                     25
./configs/lstmbox                     26
./configs/lstmdebug                   98
./configs/lstm.train                 282
./configs/makebox                     26
./configs/pdf                         22
./configs/quiet                       21
./configs/rebox                       65
./configs/strokewidth                377
./configs/tsv                         22
./configs/txt                        166
./configs/unlv                        45
./configs/wordstrbox                  29
./deu.traineddata                8628461
./eng.traineddata               15400601
./frak2021_1.069.traineddata     5060763
./fra.traineddata                3972885
./GT4HistOCR_50000000.997_191951.traineddata   4591424
./pdf.ttf                            572
./script/Latin.traineddata     101402885
./tessconfigs/batch                   49
./tessconfigs/batch.nochop            37
./tessconfigs/matdemo                243
./tessconfigs/msdemo                 368
./tessconfigs/nobatch                  1
./tessconfigs/segdemo                295
bertsky commented 1 year ago

Thanks @jbarth-ubhd for the detailled report and analysis.

Simple reason: osd.traineddata is missing. Used to get installed – checking why not.

bertsky commented 1 year ago

Got it!

https://github.com/OCR-D/ocrd_all/blob/dab852a64180e679528e24709f650334ab968cd5/Makefile#L719

… must now be $(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize.

So we have a mismatch between the install-time location and the runtime/resmgr location.

bertsky commented 1 year ago

must now be $(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize.

No, that would not work either, because we use configure --prefix=$(VIRTUAL_ENV), so Tesseract will be compiled for the share/tessdata.

Rather, there was a superflous environment variable override: https://github.com/OCR-D/ocrd_all/blob/dab852a64180e679528e24709f650334ab968cd5/Dockerfile#L47

jbarth-ubhd commented 1 year ago

Just wanted to check ocrd resmgr list-available on my workstation (ubuntu 20.04, docker, docker pulled a lot of files for ocrd/all):

jb@pers16:~> alias docker_ocrd
alias docker_ocrd='sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/
►usr/local/share/ocrd-resources ocrd/all'

jb@pers16:~> docker_ocrd ocrd resmgr list-available
Traceback (most recent call last):
  File "/usr/local/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/build/core/ocrd/ocrd/cli/resmgr.py", line 47, in list_available
    resmgr = OcrdResourceManager()
  File "/build/core/ocrd/ocrd/resource_manager.py", line 34, in __init__
    self.user_list.parent.mkdir(parents=True)
  File "/usr/lib/python3.6/pathlib.py", line 1248, in mkdir
    self._accessor.mkdir(self, mode)
  File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: '/.config/ocrd'
jbarth-ubhd commented 1 year ago

ah... with --volume $PWD/.config:/.config it works

jb@pers16:~> sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/usr/
►local/share/ocrd-resources --volume $PWD/.config:/.config ocrd/all ocrd resmgr list-available
ocrd-tesserocr-recognize
- Fraktur_GT4HistOCR.traineddata  (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/
►tessdata_fast/Fraktur_50000000.334_450937.traineddata)
  Tesseract LSTM model trained on GT4HistOCR
- ONB.traineddata  (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/
►ONB_1.195_300718_989100.traineddata)
  Tesseract LSTM model based on Austrian National Library newspaper data
- equ.traineddata  (https://github.com/tesseract-ocr/tessdata_fast/raw/main/equ.traineddata)
  Tesseract equ model
...
jbarth-ubhd commented 1 year ago

... almost

jb@pers16:~> docker_ocrd ocrd resmgr download ocrd-tesserocr-recognize configs
12:30:17.190 INFO ocrd.cli.resmgr - Downloading resource {'url': 'https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz', 'name': 'configs', 'description': 'Tesseract configs (parameter sets) for use with the 
►standalone tesseract CLI', 'size': 1915529, 'type': 'tarball', 'path_in_archive': 'tesseract-main/tessdata/configs
►', 'parameter_usage': 'as-is', 'version_range': '>= 0.0.1'}
12:30:17.193 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz to download.tar.xx
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
...

Is this my ubuntu 20.04 with dnsmasq in NetworkManager.conf?

root@pers16:/home/jb# cat /etc/NetworkManager/NetworkManager.conf
[main]
plugins=ifupdown,keyfile,ofono
dns=dnsmasq

no-auto-default=00:01:02:12:40:C5,00:21:9B:5E:BE:17,90:1B:0E:42:7D:AE,

[ifupdown]
managed=false
jbarth-ubhd commented 1 year ago

sudo docker run --dns A.B.C.D ... helped.

jbarth-ubhd commented 1 year ago

BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/

bertsky commented 1 year ago

ah... with --volume $PWD/.config:/.config it works

yes, sorry, we forgot to document this on https://ocr-d.de/en/models#models-and-docker

now tracking under https://github.com/OCR-D/ocrd-website/issues/318

bertsky commented 1 year ago

BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/

like I said above (see PR with fix), there must not be TESSDATA_PREFIX at install time (make all or make install-tesseract).

bertsky commented 1 year ago

sudo docker run --dns A.B.C.D ... helped.

I remember seeing this problem before. Also happens at build-time (docker build). You can also try with --network=host or --network=bridge.

jbarth-ubhd commented 1 year ago

schnief (german)