Closed GrazingScientist closed 4 years ago
Thanks @GrazingScientist for your detailed report!
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 94, in validate_tasks first_task.validate() File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 71, in validate param_validator = ParameterValidator(self.ocrd_tool_json) File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 49, in ocrd_tool_json self._ocrd_tool_json = json.loads(result.stdout) File "/usr/lib/python3.6/json/init.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib/python3.6/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
This looks like ocrd-cis-ocropy-binarize -J
produced some illegal JSON. Could you please state what version of Docker image you are running?
docker images ocrd/all
Input fileGrp[@USE='OCR-D-SEG-PAGE'] not in METS!
Yes, very strange indeed. The ocrd process
task sequencer seems to just skip the first task in the second run.
Probably related to https://github.com/OCR-D/core/issues/529.
I can reproduce that part. It goes away when I omit --overwrite
.
@kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?
@kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?
No, the problem is mundane: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L93
This modifies tasks
in-place, affecting the job list to execute:
https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L128
This looks like
ocrd-cis-ocropy-binarize -J
produced some illegal JSON
I do get an additional log message here from matplotlib. But it ends up on stderr (at least in my logging config)...
2020-07-16 13:29:46,118.118 INFO matplotlib.font_manager - generated new fontManager
{
"executable": "ocrd-cis-ocropy-binarize",
...
The docker version should be the most current one. docker images ocrd/all
gives me:
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/ocrd/all minimum b4e4ce729fe8 8 days ago 1.65 GB
I do get an additional log message here from matplotlib. But it ends up on stderr (at least in my logging config)...
That's it. In a vanilla config, everything ends up on stdout (which the task sequencer tries to parse). And matplotlib does not show the message when running a second time.
@kba clearly the tool itself is misbehaving (due to matplotlib's bad design choices). But how do we tackle this danger generally?
Disabling logging does not help:
ocrd-cis-ocropy-binarize -l OFF -J
2020-07-16 14:00:45,839.839 INFO matplotlib.font_manager - generated new fontManager
2020-07-16 14:00:46,038.038 INFO root - Overriding log level globally to OFF
{
"executable": "ocrd-cis-ocropy-binarize",
Setting up a ~/ocrd_logging.cfg
with stderr handler helps, but...
So second problem being identical to #529, should we make this issue about the first problem, renaming it (e.g. ocrd process fails due to processor's logging mixed with JSON dump
)?
I am fine with that (if this refered to me). :)
@kba is it possible that click does something funny when the order of parameters is not exactly the same as the decorators leading up to it?
No, the problem is mundane: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L93
This modifies
tasks
in-place, affecting the job list to execute: https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/task_sequence.py#L128
That was a conscious change about the first task but I obviously didn't consider the consequences. I'll add a fix shortly and put in place tests to prevent this in the future. Thanks for investigating.
clearly the tool itself is misbehaving (due to matplotlib's bad design choices). But how do we tackle this danger generally?
Disabling logging [on the command line] does not help:
Here we set up logging even when all we are asked to do is dump JSON or version or help:
But moving the getLogger
into the branches might still not be enough. All processors inherit from ocrd.processor.Processor
which necessarily imports ocrd.processor
which has a module-level logging setup:
Maybe the only thing we can do is to try to disable all logging as soon as we know the job is to only dump the JSON:
logging.disable(logging.CRITICAL)
That was the wrong fix:
$ ocrd-dummy -J
13:32:15.762 INFO root - Overriding log level globally to OFF
{
"executable": "ocrd-dummy",
"description": "Bare-bones processor that copies file from input group to output group",
"steps": [
"preprocessing/optimization"
],
"categories": [
"Image preprocessing"
],
"input_file_grp": "DUMMY_INPUT",
"output_file_grp": "DUMMY_OUTPUT"
}
Problem Description We use the docker image
ocrd/all:minimum
for testing purposes. We are running the ocrd process example from the documentation in a slightly derived form:(Please not that we removed
-p param-tess-fraktur.json
from the original example) When running these processes each seperately, everything is fine and runs without error.However, the
ocrd process
pipeline causes the following log output:Then, running the same script again, without changing anything. The log gives:
So, in the second attempt, the script goes on, when in the first try it had an error.
Also, in the second attempt, although the first process in the given pipeline (
cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE
) should generate afileGrp[@USE='OCR-D-SEG-PAGE']
in themets.xml
, this process seems to not run and consequently the next process cannot access this information.Reproduction On the host, I am in a folder with an
images
folder, containing two digitized page imagesBild1.jpg
andBild2.jpg
and a Shell scriptproblem.sh
:The content of
problem.sh
is:Then I run:
Note that we use skip the
-u $(id -u)
parameter, because we have podman running in the background and this parameter causes issues with file permissions.