Open kba opened 1 year ago
What does the resmgr log say?
What does the resmgr log say?
Nothing interesting, it only logs what it is downloading, not what it's supposed to be downloading or how it decided which processors should be included. I'll add a such a log statement when debugging.
Here is a snippet from my sbatch script that downloads all models:
singularity exec --bind "${SCRATCH_OCRD_MODELS_BASE}:/usr/local/share" "${SIF_PATH}" ocrd resmgr download '*'
singularity exec --bind "${SCRATCH_OCRD_MODELS_BASE}:/usr/local/share" "${SIF_PATH}" ocrd resmgr download ocrd-tesserocr-recognize '*'
For comparison check the models downloaded with older version (not sure which one, the latest one in January) of ocrd/all:maximum
when ocrd-tesserocr-recognize
models used to be located under ocrd-resources
folder:
docker run --rm -v "/home/cloud/ocrd_models/:/usr/local/share/ocrd-resources" -- ocrd/all:maximum ocrd resmgr download '*'
The models are way less than what they used to be. The total size of the downloaded models is just 687MB. It used to be around 5.4GB. Also some processor models are now completely missing or not downloaded at all.
It's clear the reason for this is that ResourceManager.list_available only returns database results – it does not look up all ocrd-
executables in PATH. (For comparison, ResourceManager.list_installed returns database results and all resource location paths with ocrd-
prefix, which is somewhat better, but still misses out on processors' module locations, as in ocrd_tesserocr.) The database then is simply the distributed resource_list.yml
plus any user resources.yml
. At no time do we guarantee that the latter gets filled from PATH dynamically!
I cannot find when exactly this broke, but this change looks somewhat fishy.
Since we never know when the user installs (additional) processor modules, and the database files can be out of date (as is currently the case with the distributed resource_list.yml which still contains sbb-textline-detector
), IMO the correct behaviour would be:
list-available *
: unless short-circuited with ocrd-all-tool.json, and unless dynamic=False
, look up all ocrd-
executables in PATH via --dump-json
, add their resouce specs to the user database, and then output all known resourceslist-installed *
: unless short-circuited with ocrd-all-tool.json, and unless dynamic=False
, look up all ocrd-
executables in PATH via --dump-json
, add their resouce specs to the user database, and then look up all known resource locationsSpeaking of short-circuiting with ocrd-all-tool.json
: we do not have a dedicated issue for that, but since it's probably tied to the solution here, anyway: The idea would be to have a lookup mechanism like for ocrd_logging.conf (i.e. system location, XDG-based user location, CWD) as an opt-in for ocrd-all-tool.json. If that file can be found, then replace all dynamic lookups with queries into the list of all tools and their resources. (Of course, relying on that file creates new problems like keeping ocrd-all-tool.json up to date if you install more tools, but let's first concentrate on the substantial performance gains that this will yield.)
I've opened a separate issue for the ocrd-all-tool.json
aspect in https://github.com/OCR-D/core/issues/1059
When running
ocrd resmgr download '*'
in latest ocrd_all Docker image only some models are installed:E.g.
ocrd-tesserocr-recognize
models missing entirely.ocrd resmgr download ocrd-tesserocr-recognize '*'
working as expected.So, something wrong with iterating over the processors for the wildcard case.