OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
117 stars 31 forks source link

resmgr list-installed only knows about 3 processors with preconfigured resources #1251

Open bertsky opened 5 days ago

bertsky commented 5 days ago

But surely, such a user database was at least created by that call. And if you did not run any list-available prior to that, then that database would be just a mirror of the distributed ocrd/resource_list.yml (hence only those 3 processors).

So we just uncovered another serious bug: initialisation does not search the PATH for ocrd-* executables, only list-available does. But without these database entries, list-installed never even attempts to look for other executables!

_Originally posted by @bertsky in https://github.com/OCR-D/core/pull/1246#discussion_r1665951656_

bertsky commented 5 days ago

explanation

ResourceManager inits its database from the predistributed ocrd/resource_list.yml: https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L42-L47

New database entries only get made by either

  1. list-available (with some executable glob pattern) https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L100-L109
  2. list-installed (when explicitly naming the executable) https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L168

(So not even a download ensures the respective entry exists!)

However, list-installed only lists models found for processors in the database, plus any found under XDG_DATA_HOME (data location) and /usr/local/share (system location). https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L136-L141

So it does not cover:

expectation

list-installed * or just list-installed (without a name) should look for all executables in PATH, regardless of existing database entries.

Perhaps, considering #1250, we could make an exception if some ocrd-all-tool.json is installed: in that case, one should not waste time searching PATH, but can just pick the precomputed list.

bertsky commented 5 days ago

No, wait.

  • processors other than the 3 in ocrd/resource_list.yml if XDG_DATA_HOME is just a symlink (as is the case in ocrd/all Docker)

that's not true, it should be indepedendant of whether it's a symlink. More likely, we just ran into https://github.com/OCR-D/ocrd_all/issues/394 again – without noticing.

(So not even a download ensures the respective entry exists!)

If you enter via cli.resmgr.download, then a (dynamic) list_available (creating entries) will be part of the process.

MehmedGIT commented 4 days ago

I have removed the ~/.config/ocrd/resources.yml, then installed the core again from the current master branch. This is the result:

(venv38-core) mm@MM-Notebook:~/repos/core$ ocrd resmgr list-installed
12:38:19.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource '3gs.csv.gz' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/3gs.csv.gz) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.374 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'config.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/config.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'model.zip' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/model.zip) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.402 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'ocrd-cis.jar' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/ocrd-cis.jar) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.418 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'stopwords.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/div/stopwords.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:25.828 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'Fraktur.traineddata' (/home/mm/venv38-all/share/tessdata/Fraktur.traineddata) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.614 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'alto' (/home/mm/venv38-all/share/tessdata/configs/alto) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.644 ERROR ocrd.resource_manager - [ocrd-tesserocr-recognize.2] Additional properties are not allowed ('path' was unexpected)
Traceback (most recent call last):
  File "/home/mm/venv38-core/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/cli/resmgr.py", line 64, in list_installed
    for executable, reslist in resmgr.list_installed(executable):
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 168, in list_installed
    resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 202, in add_to_user_database
    self.load_resource_list(self.user_list)
  File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 84, in load_resource_list
    raise ValueError("Resource list %s is invalid!" % (list_filename))
ValueError: Resource list /home/mm/.config/ocrd/resources.yml is invalid!

I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.

bertsky commented 4 days ago

I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.

I agree – the user database (as a file) does not seem useful. Any subsequent list-installed will have to do a filesystem search anyway. And we do get lots of false positive entries – like the config/* stuff in Tesseract, or in other cases confusing model directories with model files.

We should also get rid of the preconfigured ocrd/resource_list.yml – ocrd-sbb-binarize model info is outdated, ocrd-cis-ocropy-recognize I have just added to the ocrd-tool.json (just needs an update in ocrd_all), and ocrd-calamari-recognize as soon as https://github.com/OCR-D/ocrd_calamari/pull/112 gets merged and updated in ocrd_all.