Open bertsky opened 5 days ago
ResourceManager inits its database from the predistributed ocrd/resource_list.yml
:
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L42-L47
New database entries only get made by either
list-available
(with some executable glob pattern)
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L100-L109list-installed
(when explicitly naming the executable)
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L168(So not even a download
ensures the respective entry exists!)
However, list-installed
only lists models found for processors in the database, plus any found under XDG_DATA_HOME (data
location) and /usr/local/share (system
location).
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/resource_manager.py#L136-L141
So it does not cover:
module
location for resourcesocrd/resource_list.yml
if XDG_DATA_HOME is just a symlink (as is the case in ocrd/all Docker)list-installed *
or just list-installed
(without a name) should look for all executables in PATH, regardless of existing database entries.
Perhaps, considering #1250, we could make an exception if some ocrd-all-tool.json
is installed: in that case, one should not waste time searching PATH, but can just pick the precomputed list.
No, wait.
- processors other than the 3 in
ocrd/resource_list.yml
if XDG_DATA_HOME is just a symlink (as is the case in ocrd/all Docker)
that's not true, it should be indepedendant of whether it's a symlink. More likely, we just ran into https://github.com/OCR-D/ocrd_all/issues/394 again – without noticing.
(So not even a
download
ensures the respective entry exists!)
If you enter via cli.resmgr.download
, then a (dynamic) list_available
(creating entries) will be part of the process.
I have removed the ~/.config/ocrd/resources.yml
, then installed the core again from the current master branch. This is the result:
(venv38-core) mm@MM-Notebook:~/repos/core$ ocrd resmgr list-installed
12:38:19.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource '3gs.csv.gz' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/3gs.csv.gz) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.374 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'config.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/config.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.387 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'model.zip' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/model.zip) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.402 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'ocrd-cis.jar' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/data/ocrd-cis.jar) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:20.418 INFO ocrd.resource_manager - ocrd-cis-ocropy-recognize resource 'stopwords.json' (/home/mm/venv38-all/lib/python3.8/site-packages/ocrd_cis/div/stopwords.json) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:25.828 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'Fraktur.traineddata' (/home/mm/venv38-all/share/tessdata/Fraktur.traineddata) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.614 INFO ocrd.resource_manager - ocrd-tesserocr-recognize resource 'alto' (/home/mm/venv38-all/share/tessdata/configs/alto) not a known resource, creating stub in /home/mm/.config/ocrd/resources.yml'
12:38:26.644 ERROR ocrd.resource_manager - [ocrd-tesserocr-recognize.2] Additional properties are not allowed ('path' was unexpected)
Traceback (most recent call last):
File "/home/mm/venv38-core/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mm/venv38-core/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/cli/resmgr.py", line 64, in list_installed
for executable, reslist in resmgr.list_installed(executable):
File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 168, in list_installed
resdict = self.add_to_user_database(this_executable, res_filename, resource_type=res_type)
File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 202, in add_to_user_database
self.load_resource_list(self.user_list)
File "/home/mm/repos/core/build/__editable__.ocrd-2.66.1-py3-none-any/ocrd/resource_manager.py", line 84, in load_resource_list
raise ValueError("Resource list %s is invalid!" % (list_filename))
ValueError: Resource list /home/mm/.config/ocrd/resources.yml is invalid!
I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.
I am not even sure why we have something like a database. It is for caching purposes obviously, but the state becomes inconsistent and leads to unexpected errors over time.
I agree – the user database (as a file) does not seem useful. Any subsequent list-installed
will have to do a filesystem search anyway. And we do get lots of false positive entries – like the config/*
stuff in Tesseract, or in other cases confusing model directories with model files.
We should also get rid of the preconfigured ocrd/resource_list.yml
– ocrd-sbb-binarize model info is outdated, ocrd-cis-ocropy-recognize I have just added to the ocrd-tool.json (just needs an update in ocrd_all), and ocrd-calamari-recognize as soon as https://github.com/OCR-D/ocrd_calamari/pull/112 gets merged and updated in ocrd_all.
_Originally posted by @bertsky in https://github.com/OCR-D/core/pull/1246#discussion_r1665951656_