OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

ocrd process fails due to processor's logging mixed with JSON dump #540

Closed GrazingScientist closed 4 years ago

GrazingScientist commented 4 years ago

Problem Description We use the docker image ocrd/all:minimum for testing purposes. We are running the ocrd process example from the documentation in a slightly derived form:

ocrd process -l DEBUG --overwrite  
'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE'   
'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'  
 'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE'  
 'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'

(Please not that we removed -p param-tess-fraktur.json from the original example) When running these processes each seperately, everything is fine and runs without error.

However, the ocrd process pipeline causes the following log output:

2020-07-15 14:13:32,208.208 INFO root - Overriding log level globally to DEBUG
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 125, in run_tasks
    validate_tasks(tasks, workspace, page_id, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 94, in validate_tasks
    first_task.validate()
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 71, in validate
    param_validator = ParameterValidator(self.ocrd_tool_json)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 49, in ocrd_tool_json
    self._ocrd_tool_json = json.loads(result.stdout)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)

Then, running the same script again, without changing anything. The log gives:

2020-07-15 13:30:10,292.292 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:10,294.294 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:12,289.289 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=[]
2020-07-15 13:30:15,959.959 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'
2020-07-15 13:30:15,960.960 DEBUG ocrd.processor - Running subprocess 'ocrd-tesserocr-segment-region --working-dir /data/images --mets mets.xml --log-level DEBUG --input-file-grp OCR-D-SEG-PAGE --output-file-grp OCR-D-SEG-BLOCK --overwrite'
2020-07-15 13:30:17,160.160 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:17,163.163 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:17,164.164 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-PAGE'] output_file_grp=[]
Traceback (most recent call last):
  File "/usr/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 81, in ocrd_cli_wrap_processor
    raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
    Input fileGrp[@USE='OCR-D-SEG-PAGE'] not in METS!
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 148, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-region exited with non-zero return value 1

So, in the second attempt, the script goes on, when in the first try it had an error.

Also, in the second attempt, although the first process in the given pipeline (cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE) should generate a fileGrp[@USE='OCR-D-SEG-PAGE'] in the mets.xml, this process seems to not run and consequently the next process cannot access this information.

Reproduction On the host, I am in a folder with an images folder, containing two digitized page images Bild1.jpg and Bild2.jpg and a Shell script problem.sh:

> ls
images/    problem.sh
> ls images/
Bild1.jpg   Bild2.jpg

The content of problem.sh is:

ocrd-import -P -i -r 300 images/
cd images
ocrd process -l DEBUG --overwrite \
  'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE' \
  'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \
  'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \
  'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'

Then I run:

docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The first error occurs
docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The second error occurs

Note that we use skip the -u $(id -u) parameter, because we have podman running in the background and this parameter causes issues with file permissions.

bertsky commented 4 years ago

Thanks @GrazingScientist for your detailed report!

File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 94, in validate_tasks first_task.validate() File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 71, in validate param_validator = ParameterValidator(self.ocrd_tool_json) File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 49, in ocrd_tool_json self._ocrd_tool_json = json.loads(result.stdout) File "/usr/lib/python3.6/json/init.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib/python3.6/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)

This looks like ocrd-cis-ocropy-binarize -J produced some illegal JSON. Could you please state what version of Docker image you are running?

docker images ocrd/all

Input fileGrp[@USE='OCR-D-SEG-PAGE'] not in METS!

Yes, very strange indeed. The ocrd process task sequencer seems to just skip the first task in the second run.

Probably related to https://github.com/OCR-D/core/issues/529.

I can reproduce that part. It goes away when I omit --overwrite. @kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?

bertsky commented 4 years ago

@kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?

No, the problem is mundane: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L93

This modifies tasks in-place, affecting the job list to execute: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L128

bertsky commented 4 years ago

This looks like ocrd-cis-ocropy-binarize -J produced some illegal JSON

I do get an additional log message here from matplotlib. But it ends up on stderr (at least in my logging config)...

2020-07-16 13:29:46,118.118 INFO matplotlib.font_manager - generated new fontManager
{
 "executable": "ocrd-cis-ocropy-binarize",
...
GrazingScientist commented 4 years ago

The docker version should be the most current one. docker images ocrd/all gives me:

REPOSITORY           TAG       IMAGE ID       CREATED      SIZE
docker.io/ocrd/all   minimum   b4e4ce729fe8   8 days ago   1.65 GB
bertsky commented 4 years ago

I do get an additional log message here from matplotlib. But it ends up on stderr (at least in my logging config)...

That's it. In a vanilla config, everything ends up on stdout (which the task sequencer tries to parse). And matplotlib does not show the message when running a second time.

bertsky commented 4 years ago

@kba clearly the tool itself is misbehaving (due to matplotlib's bad design choices). But how do we tackle this danger generally?

Disabling logging does not help:

ocrd-cis-ocropy-binarize -l OFF -J
2020-07-16 14:00:45,839.839 INFO matplotlib.font_manager - generated new fontManager
2020-07-16 14:00:46,038.038 INFO root - Overriding log level globally to OFF
{
 "executable": "ocrd-cis-ocropy-binarize",

Setting up a ~/ocrd_logging.cfg with stderr handler helps, but...

bertsky commented 4 years ago

So second problem being identical to #529, should we make this issue about the first problem, renaming it (e.g. ocrd process fails due to processor's logging mixed with JSON dump)?

https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L48-L49

GrazingScientist commented 4 years ago

I am fine with that (if this refered to me). :)

kba commented 4 years ago

@kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?

No, the problem is mundane: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L93

This modifies tasks in-place, affecting the job list to execute: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L128

That was a conscious change about the first task but I obviously didn't consider the consequences. I'll add a fix shortly and put in place tests to prevent this in the future. Thanks for investigating.

bertsky commented 4 years ago

clearly the tool itself is misbehaving (due to matplotlib's bad design choices). But how do we tackle this danger generally?

Disabling logging [on the command line] does not help:

Here we set up logging even when all we are asked to do is dump JSON or version or help:

https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/decorators.py#L53-L59

But moving the getLogger into the branches might still not be enough. All processors inherit from ocrd.processor.Processor which necessarily imports ocrd.processor which has a module-level logging setup:

https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/processor/base.py#L9

Maybe the only thing we can do is to try to disable all logging as soon as we know the job is to only dump the JSON:

logging.disable(logging.CRITICAL)
kba commented 4 years ago

That was the wrong fix:

$ ocrd-dummy -J
13:32:15.762 INFO root - Overriding log level globally to OFF
{
 "executable": "ocrd-dummy",
 "description": "Bare-bones processor that copies file from input group to output group",
 "steps": [
  "preprocessing/optimization"
 ],
 "categories": [
  "Image preprocessing"
 ],
 "input_file_grp": "DUMMY_INPUT",
 "output_file_grp": "DUMMY_OUTPUT"
}