OCR-D / ocrd_all

Master repository which includes most other OCR-D repositories as submodules
MIT License
72 stars 17 forks source link

/models not working #394

Closed kba closed 4 months ago

kba commented 1 year ago
          Unfortunately, I cannot confirm that this works... 
SIF_PATH="/scratch1/users/${USER}/ocrd_all_maximum_image.sif"
OCRD_MODELS_DIR="/scratch1/users/${USER}/ocrd_models"
OCRD_MODELS_DIR_IN_DOCKER="/models"

singularity exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download ocrd-tesserocr-recognize '*'

Does not work. However, if I do it the old way I used to with OCRD_MODELS_DIR_IN_DOCKER="/usr/local/share" then it works but effectively overwrites that path which leads to other issues such as https://github.com/OCR-D/ocrd_olena/issues/87

_Originally posted by @MehmedGIT in https://github.com/OCR-D/ocrd_all/issues/380#issuecomment-1768626251_

MehmedGIT commented 1 year ago

For more clarification, I can still effectively use /models for other recognizers. Only tesserocr is problematic in my case.

kba commented 1 year ago

Related https://github.com/OCR-D/ocrd_tesserocr/issues/195

jbarth-ubhd commented 9 months ago

Me too:

[xxx@o05i14 ~]$ singularity exec --bind /tmp:/tmp --bind $HOME/ocrd_models:/
►models ocrd.sif ocrd resmgr download ocrd-tesserocr-recognize 
►frak2021.traineddata
16:32:48.299 INFO ocrd.cli.resmgr - Downloading registered resource '
►frak2021.traineddata' (https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/
►frak2021/tessdata_best/frak2021-0.905.traineddata)
  [------------------------------------]    0%
16:32:50.718 INFO ocrd.cli.resmgr - [Errno 17] File exists: '/usr/local/share/
►tessdata'
16:32:50.718 INFO ocrd.cli.resmgr - Use in parameters as 'frak2021'

[xxx@o05i14 ~]$ singularity exec ocrd.sif ls -l /usr/local/share/tessdata
lrwxrwxrwx 1 root root 56 Feb 20 18:18 /usr/local/share/tessdata -> /usr/local/
►share/ocrd-resources/ocrd-tesserocr-recognize

[xxx@o05i14 ~]$ singularity exec ocrd.sif ls -l /usr/local/share/ocrd-resources
lrwxrwxrwx 1 root root 7 Feb 20 18:18 /usr/local/share/ocrd-resources -> /models
jbarth-ubhd commented 9 months ago

sbb-binarize works:

[xxx@o05i14 ~]$ singularity exec --bind $HOME/ocrd_models:/models ocrd.sif ocrd 
►resmgr download ocrd-sbb-binarize default-2021-03-09
16:41:02.869 INFO ocrd.cli.resmgr - Downloading registered resource 'default-
►2021-03-09' (https://github.com/qurator-spk/sbb_binarization/releases/download/
►v0.0.11/saved_model_2021_03_09.zip)
  [------------------------------------]    0%16:41:06.202 INFO 
►ocrd.resource_manager._download_impl - Downloading https://github.com/qurator-
►spk/sbb_binarization/releases/download/v0.0.11/saved_model_2021_03_09.zip to 
►download.tar.xx
  [####################################]  100%          16:41:07.722 INFO 
►ocrd.resource_manager.download - Extracting application/zip archive to /tmp/
►tmpcwi79zat/out
16:41:08.534 INFO ocrd.resource_manager.download - Copying '.' from archive to /
►usr/local/share/ocrd-resources/ocrd-sbb-binarize/default-2021-03-09
16:41:08.698 INFO ocrd.cli.resmgr - Installed resource https://github.com/
►qurator-spk/sbb_binarization/releases/download/v0.0.11/
►saved_model_2021_03_09.zip under /usr/local/share/ocrd-resources/ocrd-sbb-
►binarize/default-2021-03-09
16:41:08.698 INFO ocrd.cli.resmgr - Use in parameters as 'default-2021-03-09'
bertsky commented 5 months ago

I think I now know what's going on:

https://github.com/OCR-D/ocrd_all/blob/56507f1f89fcef43eab7daac06cc8ef6143aee21/Dockerfile#L136-L142

So what is supposed to happen here is that we end up with /usr/local/share/tessdata -> /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize with the preinstalled models.

However, since recently we switched to staged build (core → minimum → medium → maximum), that means we enter the above castling move over and over again. Which means:

  1. the mv in L141 will not create /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize, but place tessdata (which now is merely a symlink) beneath itself, leading to an infinite chain /usr/local/share/ocrd-resources/ocrd-tesserocr-recognize/tessdata/tessdata/tessdata/tessdata/...
  2. the ln in L142 will mirror that to/usr/local/share/tessdata/tessdata/tessdata/...

At runtime, OCR-D resource manager will choke on this structure:

10:20:50.776 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'tessdata' (/usr/local/share/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata) not a known resource, creating stub in /usr/local/share/ocrd-resources/ocrd/resources.yml'
Traceback (most recent call last):
  File "/usr/local/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/ocrd/cli/resmgr.py", line 64, in list_installed
    for executable, reslist in resmgr.list_installed(executable):
  File "/usr/local/lib/python3.8/site-packages/ocrd/resource_manager.py", line 168, in list_installed
    resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)
  File "/usr/local/lib/python3.8/site-packages/ocrd/resource_manager.py", line 183, in add_to_user_database
    res_size = Path(res_filename).stat().st_size
  File "/usr/lib/python3.8/pathlib.py", line 1198, in stat
    return self._accessor.stat(self)
OSError: [Errno 40] Too many levels of symbolic links: '/usr/local/share/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata/tessdata'

The same goes for https://github.com/OCR-D/ocrd_all/blob/56507f1f89fcef43eab7daac06cc8ef6143aee21/Dockerfile#L144-L148 which due to repetition creates a loop /models/ocrd-resources/ocrd-resources/ocrd-resources/ocrd-resources/....

So IMO what needs to be done in both cases is checking whether the target already exists.

MehmedGIT commented 5 months ago

Executing:

singularity exec ocrd_all_maximum_image.sif ls -la /models

Outputs:

drwxrwxrwx  5 root   root 111 Jul  1 17:49 .
drwxr-xr-x 26 u11874 GWDG 780 Jul  3 13:24 ..
drwxr-xr-x  2 root   root   3 Jul  1 17:45 matplotlib
drwxr-xr-x  2 root   root  36 Jul  1 17:18 ocrd
lrwxrwxrwx  1 root   root   7 Jul  1 17:38 ocrd-resources -> /models
drwxr-xr-x  4 root   root 161 Jul  1 17:49 ocrd-tesserocr-recognize

When I volume map to /models could it be that the processor still expects tessdata instead of ocrd-tesserocr-recognize?

I have the latest ocrd_all image from 01.07. But I am unsure if something is odd because core --version also prints 2.64.0 instead of 2.66.1 on my end.