OCR-D / ocrd_ocropy

OCRD CLI to ocropy
Apache License 2.0
2 stars 1 forks source link

cis-ocropy-segment crashes without error log #7

Closed beckstefan closed 4 years ago

beckstefan commented 4 years ago

With up-do-date docker I get when running

'olena-binarize -I OCR-D-IMG -O OCR-D-BIN' \
'cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW' \
'anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP' \
'cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page' \
'tesserocr-recognize -I OCR-D-PAGE-SEG -O OCR-D-OCR -P model Fraktur'

(The workflow is an attempt to get the three columns recognized correctly in http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24)

2020-09-16 11:33:03,918.918 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP -p '{"force": true, "col
Separator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax":
 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, "operation_level": "page"}''
2020-09-16 11:33:03,920.920 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -p '{"level-of-operati
on": "page", "dpi": 0, "maxcolseps": 20, "maxseps": 20, "maximages": 10, "csminheight": 4, "hlminwidth": 10, "gap_height": 0.01, "gap_width": 1.5, "overwrite_or
der": true, "overwrite_separators": true, "overwrite_regions": true, "overwrite_lines": true, "spread": 2.4}''
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/build/core/ocrd/ocrd/cli/process.py", line 28, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/build/core/ocrd/ocrd/task_sequence.py", line 149, in run_tasks
    raise Exception("%s exited with non-zero return value %s. STDOUT:\n%s\nSTDERR:\n%s" % (task.executable, returncode, out, err))
Exception: ocrd-cis-ocropy-segment exited with non-zero return value -9. STDOUT:

STDERR:

Continuing manually to get the error:

docker run --rm -v /home/ocrd/workspace/gue-11660-24-e-1/:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page -l DEBUG

Resultet in nothing happening (no output to terminal, no folder OCR-D-PAGE-SEG, except for an exit code of 137

The source images are relatively big (10MB, jpeg), but I can provide them in case of need as well.

kba commented 4 years ago

the missing STDERR was reported this week, it's a bug in core I will try to fix asap.

no idea about the exit code. but I generally discourage ocrd_ocropy, there is a much better version in ocrd_cis.

beckstefan commented 4 years ago

I wasn't really aware that there is ocrd-ocropy and ocrd-cis(-ocropy?), so the title was maybe misleading. I used ocrd-cis-ocropy-segment.

About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does ocr-cis-ocropy-segmetn use ocrd_core?

bertsky commented 4 years ago

the missing STDERR was reported this week, it's a bug in core I will try to fix asap.

@kba you mean https://github.com/OCR-D/core/issues/592?

I wasn't really aware that there is ocrd-ocropy and ocrd-cis(-ocropy?), so the title was maybe misleading. I used ocrd-cis-ocropy-segment.

@beckstefan The main work on wrapping Ocropy for OCR-D and improving it was done in ocrd_cis, whereas ocrd_ocropy does not offer anything useful yet and is currently inactive.

(I have no rights to transfer the issue to ocrd_cis, but also I am not sure it does belong there, as the problem seems to be in core's ocrd process.)

About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does ocr-cis-ocropy-segmetn use ocrd_core?

All OCR-D wrappers (Python and bash based) use OCR-D/core. What @kba was saying was that the missing log messages are a problem specific to ocrd process (which is part of core).

So, could you please run your workflow directly, by calling the individual processor CLIs instead? (So we at least know what led up to the exit -9?)

For your workflow, that'll be:

ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW
ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP
ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-PAGE-SEG -O OCR-D-OCR -P model Fraktur
beckstefan commented 4 years ago
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
2020-09-17 10:33:39,691.691 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0001.jpg (img-0001.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
2020-09-17 10:33:51,541.541 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0002.jpg (img-0002.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW -P level-of-operation page -l DEBUG
2020-09-17 10:36:14,290.290 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-deskew' took 92.366838s [--input-file-grp='OCR-D-BIN' --output-file-grp='OCR-D-DESKEW' --parameter='{"level-of-operation": "page", "maxskew": 5.0}'
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP              Matplotlib created a temporary config/cache directory at /tmp/matplotlib-w545rw0z because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2020-09-17 10:38:34,867.867 INFO ocrd.process.profile - Executing processor 'ocrd-anybaseocr-crop' took 32.927905s [--input-file-grp='OCR-D-DESKEW' --output-file-grp='OCR-D-CROP' --parameter='{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, "operation_level": "page"}']
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd@ocrd:~$ echo $?
137
bertsky commented 4 years ago

Thanks @beckstefan, we are getting there... it looks like core does not even start up the processor (as there's no message from the profile logger). Could you please run the last step with -l DEBUG?

beckstefan commented 4 years ago

No change in output.

(As I figured out, based on the above workflow, I also tried tesser-ocr-segment and anybaseocr-block-segmentation and got disgusting results, i.e. almost no segmentation was done, but that's a different issue for now.)

bertsky commented 4 years ago

No change in output.

I see. Well turns out I was wrong, the profile message only appears after the processor ran – unless it crashed.

Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

We really need to get our hands on the DEBUG level messages here. I'm afraid -l DEBUG being ineffective is a result of https://github.com/OCR-D/core/issues/597. As a workaround, you could set all loggers to that level in your ocrd_logging.conf. For your Docker installation, that would entail:

  1. create a text file with the following content:
    [loggers]
    keys=root
    [handlers]
    keys=consoleHandler
    [formatters]
    keys=defaultFormatter
    [logger_root]
    level=DEBUG
    handlers=consoleHandler
    [handler_consoleHandler]
    class=StreamHandler
    formatter=defaultFormatter
    args=(sys.stdout,)
    [formatter_defaultFormatter]
    format=%(levelname)s %(name)s - %(message)s
    datefm=%H:%M:%S
  2. spin up the container for the processor with additional options mounting that file:
    docker run ... --mount type=bind,source=ocrd_logging.conf,destination=/etc/ocrd_logging.conf ...
beckstefan commented 4 years ago

Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

Shrugs. I should have looked up 137 by myself. Indeed, memory consumption is enormous. The image has a size of 4987x6199px with 400dpi. I downscaled to 300dpi and the processed finished.

None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)

Generally speaking, our standard is 400dpi and especially newspaper tend to be big, can you roughly estimate

I know, that there are no strict answers, but a rough tendency would be nice, knowing that in particular cases the statement won't apply.

And for completeness the output:

docker run --rm -u $(id -u) --mount type=bind,source=/home/ocrd/ocrd_logging.conf,destination=/etc/ocrd_logging.conf -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page -l DEBUG
DEBUG ocrd.resolver.download_to_directory - directory=|/data| url=|/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/mets.xml' (url: '/data/mets.xml')
DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-2493.5 -3099.5]
DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by 0.50° around [2493.5 3099.5]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [2520.45295194 3121.1415968 ]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-59 -80]
bertsky commented 4 years ago

Generally speaking, our standard is 400dpi and especially newspaper tend to be big

Indeed. OCR-D of course was developed mainly focussing on printed books. Existing processors don't downscale by themselves, and we have not yet allowed making downscaled annotations as a preprocessing step. (That's because we first need PAGE-XML to support representing scale.)

I don't think it's necessary to change the functional model to support crop-based partial processing though. Machines will become more powerful, while newspapers don't grow.

can you roughly estimate

* Memory consumption for given image resolution

Phew, that's a tough (but good) question. We've made some runtime performance statistics, but without varying/factoring DPI and without looking at memory consumption yet.

Generally most rule-based processors will use algorithms of at least O(n²) in pixel resolution. But for the proportional I don't even have a ball park number. I could give you anecdotal measurements, but I have no statistics yet. Maybe we'll start gathering this though.

* How does downscaling affects recognition quality

300 DPI should always be good enough. Some processors (esp. for preprocessing and segmentation) may even run suboptimal on larger (> 500 DPI) resolutions (if they are badly written, with fixed parameters assuming a certain density).

bertsky commented 4 years ago

Maybe you should open an issue on OCR-D/ocrd-website for documenting (rough estimates) of resource requirements.

Can we close?