OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
117 stars 31 forks source link

warnings in processor output garble dump-json #1223

Open bertsky opened 2 months ago

bertsky commented 2 months ago

We have…

ocrd process \
"cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN"
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
"skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li"
"skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page"
"cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page"
"cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP"
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"
The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
09:00:38.996 ERROR ocrd.utils.get_ocrd_tool_json - ocrd-calamari-recognize --dump-json produced invalid JSON: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/usr/local/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/ocrd/cli/[process.py](http://process.py/)", line 31, in process_cli
run_tasks(mets, log_level, page_id, tasks, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 125, in run_tasks
validate_tasks(tasks, workspace, page_id, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 101, in validate_tasks
task.validate()
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 74, in validate
raise Exception(report.errors)
Exception: ["[] Additional properties are not allowed ('checkpoint_dir' was unexpected)"]

So because TF spew a warning about a CPU mismatch, the --dump-json output is broken and ocrd process cannot validate the workflow.

IMO this likely affects all processors in one form or another. When we instantiate a processor in the CLI decorator, there is no initLogging call for the --dump-json target, but TF can still do its own basicConfig, which will use stdout.

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

bertsky commented 2 months ago

To be precise, in this case the message from TF logging causing the hickup was not a warning (as the issue suggests) but an error. All our TF-based processors now suppress log messages below that level anyway. But my point is about log messages from imported modules in general, and if we find a solution here, then these special TF rules may not even be necessary anymore.

bertsky commented 2 months ago

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

Duh! That by itself won't suffice, of course: when the processor wrapper gets used, TF usually is already imported and thus initialised.

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

bertsky commented 2 months ago

So perhaps we should make sure that our processors don't make imports other than OCR-D related

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

bertsky commented 1 month ago

In another example, I get garbled JSON from the following error:

/ocrd_all/venv38/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
bertsky commented 1 month ago

Regarding the urllib3 problem – for anyone who encounters this, the fix is to update requests as specified by core:

pip install -U "requests<2.30"
bertsky commented 1 month ago

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

We should implement this along with the #322 API changes – so we don't need touch the processor code bases twice.

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

In this case it is still possible to initialise logging before importing Tensorflow, so even that should not be a problem.

As to the logging setup itself – it would not even have to be logging.disable(...) or basicConfig with no handlers or a null handler: AFAICS it would be fully sufficient if we did our normal initLogging prior to all other imports – as long as our ocrd_logging.config separates logging from stdout.