Closed kba closed 4 months ago
For more clarification, I can still effectively use /models
for other recognizers. Only tesserocr is problematic in my case.
Me too:
[xxx@o05i14 ~]$ singularity exec --bind /tmp:/tmp --bind $HOME/ocrd_models:/
►models ocrd.sif ocrd resmgr download ocrd-tesserocr-recognize
►frak2021.traineddata
16:32:48.299 INFO ocrd.cli.resmgr - Downloading registered resource '
►frak2021.traineddata' (https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/
►frak2021/tessdata_best/frak2021-0.905.traineddata)
[------------------------------------] 0%
16:32:50.718 INFO ocrd.cli.resmgr - [Errno 17] File exists: '/usr/local/share/
►tessdata'
16:32:50.718 INFO ocrd.cli.resmgr - Use in parameters as 'frak2021'
[xxx@o05i14 ~]$ singularity exec ocrd.sif ls -l /usr/local/share/tessdata
lrwxrwxrwx 1 root root 56 Feb 20 18:18 /usr/local/share/tessdata -> /usr/local/
►share/ocrd-resources/ocrd-tesserocr-recognize
[xxx@o05i14 ~]$ singularity exec ocrd.sif ls -l /usr/local/share/ocrd-resources
lrwxrwxrwx 1 root root 7 Feb 20 18:18 /usr/local/share/ocrd-resources -> /models
sbb-binarize works:
[xxx@o05i14 ~]$ singularity exec --bind $HOME/ocrd_models:/models ocrd.sif ocrd
►resmgr download ocrd-sbb-binarize default-2021-03-09
16:41:02.869 INFO ocrd.cli.resmgr - Downloading registered resource 'default-
►2021-03-09' (https://github.com/qurator-spk/sbb_binarization/releases/download/
►v0.0.11/saved_model_2021_03_09.zip)
[------------------------------------] 0%16:41:06.202 INFO
►ocrd.resource_manager._download_impl - Downloading https://github.com/qurator-
►spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip to
►download.tar.xx
[####################################] 100% 16:41:07.722 INFO
►ocrd.resource_manager.download - Extracting application/zip archive to /tmp/
►tmpcwi79zat/out
16:41:08.534 INFO ocrd.resource_manager.download - Copying '.' from archive to /
►usr/local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09
16:41:08.698 INFO ocrd.cli.resmgr - Installed resource https://github.com/
►qurator-spk/sbb_binarization/releases/download/v0.0.11/
►saved_model_2021_03_09.zip under /usr/local/share/ocrd-resources/ocrd-sbb-
►binarize/default-2021-03-09
16:41:08.698 INFO ocrd.cli.resmgr - Use in parameters as 'default-2021-03-09'
I think I now know what's going on:
https://github.com/OCR-D/ocrd_all/blob/56507f1f89fcef43eab7daac06cc8ef6143aee21/Dockerfile#L136-L142
So what is supposed to happen here is that we end up with /usr/local/share/tessdata -> /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize
with the preinstalled models.
However, since recently we switched to staged build (core → minimum → medium → maximum), that means we enter the above castling move over and over again. Which means:
mv
in L141 will not create /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize
, but place tessdata (which now is merely a symlink) beneath itself, leading to an infinite chain /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize/tessdata/tessdata/tessdata/tessdata/...
ln
in L142 will mirror that to/usr/local/share/tessdata/tessdata/tessdata/...
At runtime, OCR-D resource manager will choke on this structure:
10:20:50.776 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'tessdata' (/usr/local/share/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata) not a known resource, creating stub in /usr/local/share/ocrd-resources/ocrd/resources.yml'
Traceback (most recent call last):
File "/usr/local/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/ocrd/cli/resmgr.py", line 64, in list_installed
for executable, reslist in resmgr.list_installed(executable):
File "/usr/local/lib/python3.8/site-packages/ocrd/resource_manager.py", line 168, in list_installed
resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)
File "/usr/local/lib/python3.8/site-packages/ocrd/resource_manager.py", line 183, in add_to_user_database
res_size = Path(res_filename).stat().st_size
File "/usr/lib/python3.8/pathlib.py", line 1198, in stat
return self._accessor.stat(self)
OSError: [Errno 40] Too many levels of symbolic links: '/usr/local/share/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata'
The same goes for https://github.com/OCR-D/ocrd_all/blob/56507f1f89fcef43eab7daac06cc8ef6143aee21/Dockerfile#L144-L148 which due to repetition creates a loop /models/ocrd-resources/ocrd-resources/ocrd-resources/ocrd-resources/...
.
So IMO what needs to be done in both cases is checking whether the target already exists.
Executing:
singularity exec ocrd_all_maximum_image.sif ls -la /models
Outputs:
drwxrwxrwx 5 root root 111 Jul 1 17:49 .
drwxr-xr-x 26 u11874 GWDG 780 Jul 3 13:24 ..
drwxr-xr-x 2 root root 3 Jul 1 17:45 matplotlib
drwxr-xr-x 2 root root 36 Jul 1 17:18 ocrd
lrwxrwxrwx 1 root root 7 Jul 1 17:38 ocrd-resources -> /models
drwxr-xr-x 4 root root 161 Jul 1 17:49 ocrd-tesserocr-recognize
When I volume map to /models
could it be that the processor still expects tessdata
instead of ocrd-tesserocr-recognize
?
I have the latest ocrd_all
image from 01.07. But I am unsure if something is odd because core --version
also prints 2.64.0
instead of 2.66.1
on my end.
Does not work. However, if I do it the old way I used to with
OCRD_MODELS_DIR_IN_DOCKER="/usr/local/share"
then it works but effectively overwrites that path which leads to other issues such as https://github.com/OCR-D/ocrd_olena/issues/87_Originally posted by @MehmedGIT in https://github.com/OCR-D/ocrd_all/issues/380#issuecomment-1768626251_