OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

Running in a docker volume doesn't work #84

Closed kba closed 4 years ago

kba commented 4 years ago
wget 'https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/736a2f9a-92c6-4fe3-a457-edfa3eab1fe3/data/wundt_grundriss_1896.ocrd.zip'
unzip wundt_grundriss_1896.ocrd.zip
cd data
docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER

This will run ocrd-tesserocr-binarize but will only change the serialization of the mets.xml and add the agent but not do the actual work. What am I doing wrong?

@mikegerber @bertsky @wrznr Input appreciated, thanks!

bertsky commented 4 years ago

@kba, I assume you meant ocrd/tesserocr:edge, not ocrd/tesserocr – the latter does not even contain ocrd-tesserocr-binarize yet (at least on dockerhub).

BTW, ocrd-tesserocr-binarize is not going to do any actual work on the page level (neither for input PAGE nor for PAGE generated from image). That's because Tesseract's API does not allow binarization on the page level. So no efforts have been invested in this CLI to apply the method with a fake PSM.SINGLE_BLOCK image of the page. And this binarization method is really not worth any actual effort (it merely offers global Otsu). But the way it fails is an error IMO.

You should at least see a new PAGE output.

I think what's happening is that the runtime parameters do not get passed to the processor somehow. Here's why:

  1. I always get the No output file group for images specified, falling back to 'OCR-D-IMG-BIN' warning, regardless of whether I actually provided one.
  2. any --log-level setting is ignored.
  3. It explains that no output is written – because there is no INPUT file group.

Here is a log output (obtained only via ocrd_logging.py):

19:57:23.571 DEBUG ocrd.processor - Running processor <class 'ocrd_tesserocr.binarize.TesserocrBinarize'>
19:57:23.572 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
19:57:23.572 DEBUG ocrd.processor - Processor instance <ocrd_tesserocr.binarize.TesserocrBinarize object at 0x7f2d7cb61cd0> (ocrd-tesserocr-binarize v0.4.1 doing preprocessing/optimization/binarization)
19:57:23.666 INFO ocrd.workspace - Saving mets '/data/mets.xml'
bertsky commented 4 years ago

But as far as I can see the decorators and processor class are all set up correctly. Something wrong with your Dockerfile, at least in the edge version, perhaps?

mikegerber commented 4 years ago

I gave up debugging this because these files are not the same:

kba commented 4 years ago

@kba, I assume you meant ocrd/tesserocr:edge, not ocrd/tesserocr – the latter does not even contain ocrd-tesserocr-binarize yet (at least on dockerhub).

Yeah, I should have been clearer: I built ocrd/tesserocr locally from the edge branch.

But the way it fails is an error IMO.

Yeah, I just want to ensure that the behavior for pip-installed and docker-run is the same. Binarization is a bad example, I agree.

I think what's happening is that the runtime parameters do not get passed to the processor somehow.

That could well be, thanks, it's a lead.

kba commented 4 years ago

I gave up debugging this because these files are not the same:

https://github.com/OCR-D/ocrd_tesserocr/blob/master/Dockerfile

https://hub.docker.com/r/ocrd/tesserocr/dockerfile

It's confusing. The first link should be

https://github.com/OCR-D/ocrd_tesserocr/blob/edge/Dockerfile

(i.e. built from the edge branch)

DockerHub only displays the dockerfile (and README) of the master branch but is configured to build master -> latest and edge -> edge.

If you are still willing to debug: The dockerfile in the edge branch builds this image on dockerhub: https://hub.docker.com/layers/ocrd/tesserocr/edge/images/sha256-1f2a30d2f2c2dfc81ba97387a51678c557f24fea672c1ac3670f70ea49f7d153

mikegerber commented 4 years ago

I'll check it next week!

mikegerber commented 4 years ago

This looks better (note the quotes):

$ docker run -u $(id -u) -w /data -v $PWD:/data -- ocrd/tesserocr:edge "ocrd-tesserocr-binarize -I OCR-D-IMG -O OCR-D-IMG-BIN-DOCKER -m mets.xml"
12:29:05.205 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
12:29:05.276 INFO processor.TesserocrBinarize - INPUT FILE 0 / phys_0001
12:29:05.282 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0001'
12:29:05.282 WARNING processor.TesserocrBinarize - Page 'phys_0001' contains no text regions
12:29:05.283 INFO processor.TesserocrBinarize - INPUT FILE 1 / phys_0002
12:29:05.284 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0002'
12:29:05.284 WARNING processor.TesserocrBinarize - Page 'phys_0002' contains no text regions
12:29:05.284 INFO processor.TesserocrBinarize - INPUT FILE 2 / phys_0003
12:29:05.285 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0003'
12:29:05.285 WARNING processor.TesserocrBinarize - Page 'phys_0003' contains no text regions
12:29:05.286 INFO processor.TesserocrBinarize - INPUT FILE 3 / phys_0004
12:29:05.286 INFO processor.TesserocrBinarize - Binarizing on 'region' level in page 'phys_0004'
12:29:05.286 WARNING processor.TesserocrBinarize - Page 'phys_0004' contains no text regions
12:29:05.289 INFO ocrd.workspace - Saving mets '/data/mets.xml'

Suggested fix (so the quotes aren't needed anymore):

diff --git a/Dockerfile b/Dockerfile
index c7b5888..0a84f03 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -21,4 +21,4 @@ RUN apt-get update && \
 RUN pip3 install --upgrade pip
 RUN make PYTHON=python3 PIP=pip3 deps install

-ENTRYPOINT ["/bin/sh", "-c"]
+ENTRYPOINT []
bertsky commented 4 years ago

This looks better (note the quotes):

That was it! You have to put all arguments into a single shell-expanded argument.

-ENTRYPOINT ["/bin/sh", "-c"]
+ENTRYPOINT []

Great! That's not going to work with our process substitution expressions (for ad-hoc parameter JSON files), but we should have the immediate JSON syntax by now.

@kba Can you recommend that for module projects' docker files in general?

kba commented 4 years ago

Suggested fix (so the quotes aren't needed anymore):

Thanks!

@kba Can you recommend that for module projects' docker files in general?

Indeed we should.