Closed jbarth-ubhd closed 1 year ago
Thanks @jbarth-ubhd for the detailled report and analysis.
Simple reason: osd.traineddata is missing. Used to get installed – checking why not.
Got it!
https://github.com/OCR-D/ocrd_all/blob/dab852a64180e679528e24709f650334ab968cd5/Makefile#L719
… must now be $(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize
.
So we have a mismatch between the install-time location and the runtime/resmgr location.
must now be
$(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize
.
No, that would not work either, because we use configure --prefix=$(VIRTUAL_ENV)
, so Tesseract will be compiled for the share/tessdata.
Rather, there was a superflous environment variable override: https://github.com/OCR-D/ocrd_all/blob/dab852a64180e679528e24709f650334ab968cd5/Dockerfile#L47
Just wanted to check ocrd resmgr list-available
on my workstation (ubuntu 20.04, docker, docker pulled a lot of files for ocrd/all):
jb@pers16:~> alias docker_ocrd
alias docker_ocrd='sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/
►usr/local/share/ocrd-resources ocrd/all'
jb@pers16:~> docker_ocrd ocrd resmgr list-available
Traceback (most recent call last):
File "/usr/local/bin/ocrd", line 33, in <module>
sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/build/core/ocrd/ocrd/cli/resmgr.py", line 47, in list_available
resmgr = OcrdResourceManager()
File "/build/core/ocrd/ocrd/resource_manager.py", line 34, in __init__
self.user_list.parent.mkdir(parents=True)
File "/usr/lib/python3.6/pathlib.py", line 1248, in mkdir
self._accessor.mkdir(self, mode)
File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: '/.config/ocrd'
ah... with --volume $PWD/.config:/.config it works
jb@pers16:~> sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/usr/
►local/share/ocrd-resources --volume $PWD/.config:/.config ocrd/all ocrd resmgr list-available
ocrd-tesserocr-recognize
- Fraktur_GT4HistOCR.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/
►tessdata_fast/Fraktur_50000000.334_450937.traineddata)
Tesseract LSTM model trained on GT4HistOCR
- ONB.traineddata (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/
►ONB_1.195_300718_989100.traineddata)
Tesseract LSTM model based on Austrian National Library newspaper data
- equ.traineddata (https://github.com/tesseract-ocr/tessdata_fast/raw/main/equ.traineddata)
Tesseract equ model
...
... almost
jb@pers16:~> docker_ocrd ocrd resmgr download ocrd-tesserocr-recognize configs
12:30:17.190 INFO ocrd.cli.resmgr - Downloading resource {'url': 'https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz', 'name': 'configs', 'description': 'Tesseract configs (parameter sets) for use with the
►standalone tesseract CLI', 'size': 1915529, 'type': 'tarball', 'path_in_archive': 'tesseract-main/tessdata/configs
►', 'parameter_usage': 'as-is', 'version_range': '>= 0.0.1'}
12:30:17.193 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz to download.tar.xx
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 175, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
...
Is this my ubuntu 20.04 with dnsmasq in NetworkManager.conf?
root@pers16:/home/jb# cat /etc/NetworkManager/NetworkManager.conf
[main]
plugins=ifupdown,keyfile,ofono
dns=dnsmasq
no-auto-default=00:01:02:12:40:C5,00:21:9B:5E:BE:17,90:1B:0E:42:7D:AE,
[ifupdown]
managed=false
sudo docker run --dns A.B.C.D ...
helped.
BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/
ah... with --volume $PWD/.config:/.config it works
yes, sorry, we forgot to document this on https://ocr-d.de/en/models#models-and-docker
now tracking under https://github.com/OCR-D/ocrd-website/issues/318
BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/
like I said above (see PR with fix), there must not be TESSDATA_PREFIX
at install time (make all or make install-tesseract).
sudo docker run --dns A.B.C.D ...
helped.
I remember seeing this problem before. Also happens at build-time (docker build). You can also try with --network=host
or --network=bridge
.
schnief (german)
PS: ocrd.sif is from ocrd/all:2022-08-15
but the directory does contains this files: