OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

list-resources not correct #174

Closed bertsky closed 1 year ago

bertsky commented 3 years ago

After we resmgrized ocrd_tesserocr in #166, running any of the CLIs with -L|--list-resources is supposed to show the exact list of models available. However, since we cannot and did not adopt the scheme with multiple resource locations, but instead use only a single directory (OcrdResourceManager's default, which is XDG_DATA_HOME/ocrd-resources/EXECUTABLE) and allow overriding it via shell variable for compatibility reasons (TESSDATA_PREFIX), the default implementation in ocrd_utils.list_all_resources does not work here.

Thus, we should extend the constructor of all ocrd_tesserocr's processors to deal with list_resources=True in its own way (by using .config.get_tessdata_path()).

kba commented 3 years ago

How do you mean that it does not work? It does list the resources in XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize for ocrd-tesserocr-recognize -L. You mean that it does not correctly handle TESSDATA_PREFIX?

bertsky commented 3 years ago

Yes, it does not respect TESSDATA_PREFIX and still shows files under /usr/local/share/ocrd-resources and CWD, which will in fact not be available. Delegating to .config.get_tessdata_path would fix that. (Probably applies to --show-resource, too.)

kba commented 3 years ago

With https://github.com/OCR-D/spec/pull/181 merged and implemented in core, the restriction on location can be expressed as

tools:
  ocrd-tesserocr-recognize:
    resource_locations: ['data']

list_all_resources can then be extended to take a list of locations to look in from the ocrd-tool.json and only list those.

We'll still need custom code in here to handle TESSDATA_PREFIX though so I am not sure whether it's worth it since ocrd_tesserocr is the only processor which would have a differing resource_locations :/