Closed kba closed 4 years ago
@kba, I assume you meant ocrd/tesserocr:edge
, not ocrd/tesserocr
– the latter does not even contain ocrd-tesserocr-binarize
yet (at least on dockerhub).
BTW, ocrd-tesserocr-binarize
is not going to do any actual work on the page level (neither for input PAGE nor for PAGE generated from image). That's because Tesseract's API does not allow binarization on the page level. So no efforts have been invested in this CLI to apply the method with a fake PSM.SINGLE_BLOCK
image of the page. And this binarization method is really not worth any actual effort (it merely offers global Otsu). But the way it fails is an error IMO.
You should at least see a new PAGE output.
I think what's happening is that the runtime parameters do not get passed to the processor somehow. Here's why:
No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
warning, regardless of whether I actually provided one.--log-level
setting is ignored.INPUT
file group.Here is a log output (obtained only via ocrd_logging.py
):
19:57:23.571 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.binarize.TesserocrBinarize'>
19:57:23.572 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
19:57:23.572 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.binarize.TesserocrBinarize object at 0x7f2d7cb61cd0> (ocrd-tesserocr-binarize v0.4.1 doing preprocessing/optimization/binarization)
19:57:23.666 INFO ocrd.workspace - Saving mets '/data/mets.xml'
But as far as I can see the decorators and processor class are all set up correctly. Something wrong with your Dockerfile
, at least in the edge
version, perhaps?
I gave up debugging this because these files are not the same:
@kba, I assume you meant ocrd/tesserocr:edge, not ocrd/tesserocr – the latter does not even contain ocrd-tesserocr-binarize yet (at least on dockerhub).
Yeah, I should have been clearer: I built ocrd/tesserocr locally from the edge branch.
But the way it fails is an error IMO.
Yeah, I just want to ensure that the behavior for pip-installed and docker-run is the same. Binarization is a bad example, I agree.
I think what's happening is that the runtime parameters do not get passed to the processor somehow.
That could well be, thanks, it's a lead.
I gave up debugging this because these files are not the same:
https://github.com/OCR-D/ocrd_tesserocr/blob/master/Dockerfile
It's confusing. The first link should be
https://github.com/OCR-D/ocrd_tesserocr/blob/edge/Dockerfile
(i.e. built from the edge
branch)
DockerHub only displays the dockerfile (and README) of the master branch but is configured to build master -> latest
and edge -> edge
.
If you are still willing to debug: The dockerfile in the edge branch builds this image on dockerhub: https://hub.docker.com/layers/ocrd/tesserocr/edge/images/sha256-1f2a30d2f2c2dfc81ba97387a51678c557f24fea672c1ac3670f70ea49f7d153
I'll check it next week!
This looks better (note the quotes):
$ docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge "ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER -m mets.xml"
12:29:05.205 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
12:29:05.276 INFO processor.TesserocrBinarize - INPUT FILE 0 / phys_0001
12:29:05.282 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0001'
12:29:05.282 WARNING processor.TesserocrBinarize - Page 'phys_0001' contains no text regions
12:29:05.283 INFO processor.TesserocrBinarize - INPUT FILE 1 / phys_0002
12:29:05.284 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0002'
12:29:05.284 WARNING processor.TesserocrBinarize - Page 'phys_0002' contains no text regions
12:29:05.284 INFO processor.TesserocrBinarize - INPUT FILE 2 / phys_0003
12:29:05.285 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0003'
12:29:05.285 WARNING processor.TesserocrBinarize - Page 'phys_0003' contains no text regions
12:29:05.286 INFO processor.TesserocrBinarize - INPUT FILE 3 / phys_0004
12:29:05.286 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0004'
12:29:05.286 WARNING processor.TesserocrBinarize - Page 'phys_0004' contains no text regions
12:29:05.289 INFO ocrd.workspace - Saving mets '/data/mets.xml'
Suggested fix (so the quotes aren't needed anymore):
diff --git a/Dockerfile b/Dockerfile
index c7b5888..0a84f03 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -21,4 +21,4 @@ RUN apt-get update && \
RUN pip3 install --upgrade pip
RUN make PYTHON=python3 PIP=pip3 deps install
-ENTRYPOINT ["/bin/sh", "-c"]
+ENTRYPOINT []
This looks better (note the quotes):
That was it! You have to put all arguments into a single shell-expanded argument.
-ENTRYPOINT ["/bin/sh", "-c"] +ENTRYPOINT []
Great! That's not going to work with our process substitution expressions (for ad-hoc parameter JSON files), but we should have the immediate JSON syntax by now.
@kba Can you recommend that for module projects' docker files in general?
Suggested fix (so the quotes aren't needed anymore):
Thanks!
@kba Can you recommend that for module projects' docker files in general?
Indeed we should.
This will run
ocrd-tesserocr-binarize
but will only change the serialization of themets.xml
and add the agent but not do the actual work. What am I doing wrong?@mikegerber @bertsky @wrznr Input appreciated, thanks!