Closed beckstefan closed 4 years ago
the missing STDERR was reported this week, it's a bug in core I will try to fix asap.
no idea about the exit code. but I generally discourage ocrd_ocropy, there is a much better version in ocrd_cis.
I wasn't really aware that there is ocrd-ocropy
and ocrd-cis(-ocropy?)
, so the title was maybe misleading. I used ocrd-cis-ocropy-segment
.
About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does ocr-cis-ocropy-segmetn
use ocrd_core
?
the missing STDERR was reported this week, it's a bug in core I will try to fix asap.
@kba you mean https://github.com/OCR-D/core/issues/592?
I wasn't really aware that there is
ocrd-ocropy
andocrd-cis(-ocropy?)
, so the title was maybe misleading. I usedocrd-cis-ocropy-segment
.
@beckstefan The main work on wrapping Ocropy for OCR-D and improving it was done in ocrd_cis, whereas ocrd_ocropy does not offer anything useful yet and is currently inactive.
(I have no rights to transfer the issue to ocrd_cis, but also I am not sure it does belong there, as the problem seems to be in core's ocrd process
.)
About the STDERR I am not sure, because running directly doesn't give any STDERR (and neither STDOUT). Or does
ocr-cis-ocropy-segmetn
useocrd_core
?
All OCR-D wrappers (Python and bash based) use OCR-D/core. What @kba was saying was that the missing log messages are a problem specific to ocrd process
(which is part of core).
So, could you please run your workflow directly, by calling the individual processor CLIs instead? (So we at least know what led up to the exit -9?)
For your workflow, that'll be:
ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW
ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP
ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd-tesserocr-recognize -I OCR-D-PAGE-SEG -O OCR-D-OCR -P model Fraktur
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN
2020-09-17 10:33:39,691.691 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0001.jpg (img-0001.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
2020-09-17 10:33:51,541.541 INFO ocrd-olena-binarize - processing image/jpeg input file OCR-D-IMG-0002.jpg (img-0002.jpg)
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
Warning: integral_browsing - Adjusting window height since it was larger than image width.
Warning: integral_browsing - Adjusting window width since it was larger than image height.
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-deskew -I OCR-D-BIN -O OCR-D-DESKEW -P level-of-operation page -l DEBUG
2020-09-17 10:36:14,290.290 INFO ocrd.process.profile - Executing processor 'ocrd-cis-ocropy-deskew' took 92.366838s [--input-file-grp='OCR-D-BIN' --output-file-grp='OCR-D-DESKEW' --parameter='{"level-of-operation": "page", "maxskew": 5.0}'
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP Matplotlib created a temporary config/cache directory at /tmp/matplotlib-w545rw0z because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2020-09-17 10:38:34,867.867 INFO ocrd.process.profile - Executing processor 'ocrd-anybaseocr-crop' took 32.927905s [--input-file-grp='OCR-D-DESKEW' --output-file-grp='OCR-D-CROP' --parameter='{"force": true, "colSeparator": 0.04, "maxRularArea": 0.3, "minArea": 0.05, "minRularArea": 0.01, "positionBelow": 0.75, "positionLeft": 0.4, "positionRight": 0.6, "rularRatioMax": 10.0, "rularRatioMin": 3.0, "rularWidth": 0.95, "operation_level": "page"}']
docker run --rm -u $(id -u) -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page
ocrd@ocrd:~$ echo $?
137
Thanks @beckstefan, we are getting there... it looks like core does not even start up the processor (as there's no message from the profile logger). Could you please run the last step with -l DEBUG
?
No change in output.
(As I figured out, based on the above workflow, I also tried tesser-ocr-segment
and anybaseocr-block-segmentation
and got disgusting results, i.e. almost no segmentation was done, but that's a different issue for now.)
No change in output.
I see. Well turns out I was wrong, the profile message only appears after the processor ran – unless it crashed.
Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment
might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)
We really need to get our hands on the DEBUG
level messages here. I'm afraid -l DEBUG
being ineffective is a result of
https://github.com/OCR-D/core/issues/597. As a workaround, you could set all loggers to that level in your ocrd_logging.conf
. For your Docker installation, that would entail:
[loggers]
keys=root
[handlers]
keys=consoleHandler
[formatters]
keys=defaultFormatter
[logger_root]
level=DEBUG
handlers=consoleHandler
[handler_consoleHandler]
class=StreamHandler
formatter=defaultFormatter
args=(sys.stdout,)
[formatter_defaultFormatter]
format=%(levelname)s %(name)s - %(message)s
datefm=%H:%M:%S
docker run ... --mount type=bind,source=ocrd_logging.conf,destination=/etc/ocrd_logging.conf ...
Exit 137 could mean your container went out of memory for some reason, and ocrd-cis-ocropy-segment might be inefficient with large images. None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)
Shrugs. I should have looked up 137
by myself. Indeed, memory consumption is enormous. The image has a size of 4987x6199px
with 400dpi. I downscaled to 300dpi and the processed finished.
None of our processors currently downscale images in between, so if they are much higher than 600 DPI, consider downscaling externally. (This situation will improve in the future.)
Generally speaking, our standard is 400dpi and especially newspaper tend to be big, can you roughly estimate
I know, that there are no strict answers, but a rough tendency would be nice, knowing that in particular cases the statement won't apply.
And for completeness the output:
docker run --rm -u $(id -u) --mount type=bind,source=/home/ocrd/ocrd_logging.conf,destination=/etc/ocrd_logging.conf -v /home/ocrd/workspace/test:/data -- ocrd/all:maximum ocrd-cis-ocropy-segment -I OCR-D-CROP -O OCR-D-PAGE-SEG -P level-of-operation page -l DEBUG
DEBUG ocrd.resolver.download_to_directory - directory=|/data| url=|/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/mets.xml' (url: '/data/mets.xml')
DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 65536
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-2493.5 -3099.5]
DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by 0.50° around [2493.5 3099.5]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [2520.45295194 3121.1415968 ]
DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-59 -80]
Generally speaking, our standard is 400dpi and especially newspaper tend to be big
Indeed. OCR-D of course was developed mainly focussing on printed books. Existing processors don't downscale by themselves, and we have not yet allowed making downscaled annotations as a preprocessing step. (That's because we first need PAGE-XML to support representing scale.)
I don't think it's necessary to change the functional model to support crop-based partial processing though. Machines will become more powerful, while newspapers don't grow.
can you roughly estimate
* Memory consumption for given image resolution
Phew, that's a tough (but good) question. We've made some runtime performance statistics, but without varying/factoring DPI and without looking at memory consumption yet.
Generally most rule-based processors will use algorithms of at least O(n²)
in pixel resolution. But for the proportional I don't even have a ball park number. I could give you anecdotal measurements, but I have no statistics yet. Maybe we'll start gathering this though.
* How does downscaling affects recognition quality
300 DPI should always be good enough. Some processors (esp. for preprocessing and segmentation) may even run suboptimal on larger (> 500 DPI) resolutions (if they are badly written, with fixed parameters assuming a certain density).
Maybe you should open an issue on OCR-D/ocrd-website for documenting (rough estimates) of resource requirements.
Can we close?
With up-do-date docker I get when running
(The workflow is an attempt to get the three columns recognized correctly in http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24)
Continuing manually to get the error:
Resultet in nothing happening (no output to terminal, no folder OCR-D-PAGE-SEG, except for an exit code of
137
The source images are relatively big (10MB, jpeg), but I can provide them in case of need as well.