OCR-D / ocrd_olena

Binarize with Olena/scribo
GNU General Public License v2.0
6 stars 9 forks source link

sauvola-ms-split in combination with anybaseocr-crop #48

Closed jbarth-ubhd closed 4 years ago

jbarth-ubhd commented 4 years ago

I'm trying some variations of workflows; all with olena and impl:sauvola-ms-split with this workflow (similar to workflow-configuration/crop-...-tesseract.mk) (also with k=0.34):

/usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.08}'" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.08}'" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p '{\"level-of-operation\":\"page\",\"noise_maxsize\":3.0}'" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p '{\"level-of-operation\":\"page\",\"maxskew\":5}'" \
"tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6 -p '{\"padding\":5,\"find_tables\":false}'" \
"segment-repair -I OCR-D-N6 -O OCR-D-N7 -p '{\"plausibilize\":true}'" \
"cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-segment -I OCR-D-N8 -O OCR-D-N9 -p '{\"spread\":2.4}'" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"calamari-recognize -I OCR-D-N10 -O OCR-D-OCR -p '{\"checkpoint\":\"/usr/local/ocrd_models/calamari/calamari_models/fraktur_19th_century/*.ckpt.json\"}'"

are complaining this:

17:37:32.941 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-N2_0017 (P_0017)
17:37:32.966 INFO ocrd-olena-binarize - found AlternativeImage filename 'OCR-D-IMG-CROP/OCR-D-IMG-CROP_0017.png'
warning: magick read: sauvola_ms_split: No image was loaded.
terminate called after throwing an instance of 'Magick::ErrorCache'
  what():  sauvola_ms_split: no pixels defined in cache `OCR-D-IMG-CROP/OCR-D-IMG-CROP_0017.png' @ error/cache.c/OpenPixelCache/3906
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/process.py", line 26, in process_cli
    run_tasks(mets, log_level, page_id, tasks)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/task_sequence.py", line 131, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-olena-binarize exited with non-zero return value 134
Command exited with non-zero status 1
997.71user 491.77system 9:25.67elapsed 263%CPU (0avgtext+0avgdata 2300620maxresident)k
568inputs+11874264outputs (304major+10785505minor)pagefaults 0swaps

Always on the 17th image; not so: sauvola (without -ms-split) Image see https://digi.ub.uni-heidelberg.de/diglitData/jb/23_-_lehrgegenstaende1790a_-_0_030.tif

OCR-D-IMG-CROP/OCR-D-IMG-CROP_0017.png exists.

bertsky commented 4 years ago

Do you still have a log/output of the processors up to that point? (ocrd-make will always create one named after each output filegrp plus .log)

If you have to run it again, please use verbose logging (-l DEBUG), too.

Also, can you please provide the input PAGE-XML file prior to failure?

Note: ocrd-olena-binarize does not read the AlternativeImage (because it's already binarized), but tries to crop from the original. So if it cannot read that file, there's a bigger problem...

jbarth-ubhd commented 4 years ago

workflow with k=0.34 and segmentation "workflow-configuration style" (sorry, no .mk, I'm trying some variants of segmentation etc. automatically, hence N1 N2 N3...):

/usr/bin/time ocrd process \
"olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.34}'" \
"anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
"olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.34}'" \
"cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p '{\"level-of-operation\":\"page\",\"noise_maxsize\":3.0}'" \
"cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p '{\"level-of-operation\":\"page\",\"maxskew\":5}'" \
"tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6 -p '{\"padding\":5,\"find_tables\":false}'" \
"segment-repair -I OCR-D-N6 -O OCR-D-N7 -p '{\"plausibilize\":true}'" \
"cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8" \
"cis-ocropy-segment -I OCR-D-N8 -O OCR-D-N9 -p '{\"spread\":2.4}'" \
"cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10" \
"tesserocr-recognize -I OCR-D-N10 -O OCR-D-OCR -p '{\"textequiv_level\":\"glyph\",\"overwrite_words\":true,\"model\":\"GT4HistOCR_50000000.575_401209\"}'"

complete output:

13:12:37.937 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=['OCR-D-N1', 'OCR-D-M1']
Using TensorFlow backend.
13:12:42.448 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
13:12:52.871 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
13:12:56.053 INFO ocrd-olena-binarize - processing image/tiff input file OCR-D-IMG_0001 (P_0001)
warning: magick read: sauvola_ms_split: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/949
13:13:07.678 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-auto/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
identify-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
13:13:09.113 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-auto/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
13:13:09.249 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
13:13:09.251 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
Using TensorFlow backend.
13:13:11.347 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
13:13:11.385 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N1'] output_file_grp=['OCR-D-N2']
13:13:11.386 INFO OcrdAnybaseocrCropper - OUTPUT FILE OCR-D-N2
13:13:11.387 INFO OcrdAnybaseocrCropper - No output file group for images specified, falling back to 'OCR-D-IMG-CROP'
13:13:11.387 INFO OcrdAnybaseocrCropper - INPUT FILE 0 / P_0001
13:13:27.423 INFO ocrd.workspace - created file ID: OCR-D-IMG-CROP_0001, file_grp: OCR-D-IMG-CROP, path: OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png
13:13:27.428 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-auto/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
13:13:27.885 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
13:13:27.886 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.34}'
13:13:30.861 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-N2_0001 (P_0001)
13:13:30.885 INFO ocrd-olena-binarize - found AlternativeImage filename 'OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png'
warning: magick read: sauvola_ms_split: No image was loaded.
terminate called after throwing an instance of 'Magick::ErrorCache'
  what():  sauvola_ms_split: no pixels defined in cache `OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png' @ error/cache.c/OpenPixelCache/3906
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/process.py", line 26, in process_cli
    run_tasks(mets, log_level, page_id, tasks)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/task_sequence.py", line 131, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-olena-binarize exited with non-zero return value 134
Command exited with non-zero status 1
122.25user 124.14system 0:55.52elapsed 443%CPU (0avgtext+0avgdata 2111448maxresident)k
419632inputs+3748640outputs (863major+1954118minor)pagefaults 0swaps
jbarth-ubhd commented 4 years ago

OCR-D-N1_0001.xml:

<?xml version="1.0" encoding="UTF-8"?>
<PcGts xmlns:xsl="http://www.w3.org/1999/XSL/Transform#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
  <Metadata>
    <Creator>OCR-D/core 2.4.4</Creator>
    <Created>2020-04-08T13:13:08</Created>
    <LastChange>2020-04-08T13:13:08</LastChange>
    <MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
      <Labels>
        <Label value="101" type="win-size"/>
        <Label value="sauvola-ms-split" type="impl"/>
        <Label value="0.34" type="k"/>
      </Labels>
    </MetadataItem>
  </Metadata>
  <Page imageFilename="23_-_lehrgegenstaende1790a_-_0_030.tif" imageWidth="4942" imageHeight="8418" type="content">
    <AlternativeImage filename="OCR-D-M1/OCR-D-N1_0001-BIN_sauvola-ms-split.png" comments="binarized"/>
  </Page>
</PcGts>

OCR-D-N2_0001.xml:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.4.4</pc:Creator>
        <pc:Created>2020-04-08T13:13:08</pc:Created>
        <pc:LastChange>2020-04-08T13:13:08</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
            <pc:Labels>
                <pc:Label value="101" type="win-size"/>
                <pc:Label value="sauvola-ms-split" type="impl"/>
                <pc:Label value="0.34" type="k"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/cropping" value="ocrd-anybaseocr-crop">
            <pc:Labels>
                <pc:Label value="True" type="force"/>
                <pc:Label value="0.04" type="colSeparator"/>
                <pc:Label value="0.3" type="maxRularArea"/>
                <pc:Label value="0.05" type="minArea"/>
                <pc:Label value="0.01" type="minRularArea"/>
                <pc:Label value="0.75" type="positionBelow"/>
                <pc:Label value="0.4" type="positionLeft"/>
                <pc:Label value="0.6" type="positionRight"/>
                <pc:Label value="10.0" type="rularRatioMax"/>
                <pc:Label value="3.0" type="rularRatioMin"/>
                <pc:Label value="0.95" type="rularWidth"/>
                <pc:Label value="page" type="operation_level"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="23_-_lehrgegenstaende1790a_-_0_030.tif" imageWidth="4942" imageHeight="8418" type="content">
        <pc:AlternativeImage filename="OCR-D-M1/OCR-D-N1_0001-BIN_sauvola-ms-split.png" comments="binarized"/>
        <pc:AlternativeImage filename="OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png" comments="binarized,cropped"/>
        <pc:Border>
            <pc:Coords points="0,0 4942,0 4942,8418 0,8418"/>
        </pc:Border>
    </pc:Page>
</pc:PcGts>

OCR-D-N3_0001.xml:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.4.4</pc:Creator>
        <pc:Created>2020-04-08T13:13:08</pc:Created>
        <pc:LastChange>2020-04-08T13:13:08</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
            <pc:Labels>
                <pc:Label value="101" type="win-size"/>
                <pc:Label value="sauvola-ms-split" type="impl"/>
                <pc:Label value="0.34" type="k"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/cropping" value="ocrd-anybaseocr-crop">
            <pc:Labels>
                <pc:Label value="True" type="force"/>
                <pc:Label value="0.04" type="colSeparator"/>
                <pc:Label value="0.3" type="maxRularArea"/>
                <pc:Label value="0.05" type="minArea"/>
                <pc:Label value="0.01" type="minRularArea"/>
                <pc:Label value="0.75" type="positionBelow"/>
                <pc:Label value="0.4" type="positionLeft"/>
                <pc:Label value="0.6" type="positionRight"/>
                <pc:Label value="10.0" type="rularRatioMax"/>
                <pc:Label value="3.0" type="rularRatioMin"/>
                <pc:Label value="0.95" type="rularWidth"/>
                <pc:Label value="page" type="operation_level"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="23_-_lehrgegenstaende1790a_-_0_030.tif" imageWidth="4942" imageHeight="8418" type="content">
        <pc:AlternativeImage filename="OCR-D-M1/OCR-D-N1_0001-BIN_sauvola-ms-split.png" comments="binarized"/>
        <pc:AlternativeImage filename="OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png" comments="binarized,cropped"/>
        <pc:Border>
            <pc:Coords points="0,0 4942,0 4942,8418 0,8418"/>
        </pc:Border>
    </pc:Page>
</pc:PcGts>
bertsky commented 4 years ago

found AlternativeImage filename 'OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png'

This is not from the current master of ocrd_olena. It therefore tries to re-binarize the already binarized image.

But this should still work (if badly). The error you get from IM...

no pixels defined in cache `OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png' @ error/cache.c/OpenPixelCache/3906

...seems to suggest there's some kind of caching at work – which then fails. Also, you say this only happened in the 17th image, which fits that picture.

The only mention I find of this is this thread. So, which version of ImageMagick do you have compiled into your olena / installed on your system?

(On Debian/Ubuntu, it's dpkg-query -l libmagick++-*-dev.)

jbarth-ubhd commented 4 years ago

I've set up a fresh ocrd_all on 2020-04-03 13:06.

Imagemagick is

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version                     Architecture Description
+++-================================-===========================-============-======================================================
un  libmagick++-6.defaultquantum-dev <none>                      <none>       (no description available)
ii  libmagick++-6.q16-dev:amd64      8:6.9.10.23+dfsg-2.1ubuntu3 amd64        C++ interface to ImageMagick - development files (Q16)

PS: It's Ubuntu 19.10 as schroot-container within Ubuntu 16.04

PixelCache: https://www.imagemagick.org/include/architecture.php#cache

identify OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png 
OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png PNG 4942x8418 4942x8418+0+0 8-bit Gray 2c 541039B 0.000u 0:00.009

Should I have done

"olena-binarize -I OCR-D-N2,OCR-D-IMG -O OCR-D-N3,OCR-D-M3 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.34}'" \

instead?

jbarth-ubhd commented 4 years ago

Took a look into ocrd-olena-binarize: x,y does only work for output groups?

jbarth-ubhd commented 4 years ago

PS: with sauvola (no -ms-split) it does work.

bertsky commented 4 years ago

I've set up a fresh ocrd_all on 2020-04-03 13:06.

This does not include d69ac0eaa1859419e8f463d0a61e642cf94268aa. But current ocrd_all master does (and its prebuilt Docker images are nearing completion right now).

ii libmagick++-6.q16-dev:amd64 8:6.9.10.23+dfsg-2.1ubuntu3 amd64 C++ interface to ImageMagick - development files (Q16)

That's quite recent. But a glance over the changes since then tell me our problem might already have been solved. Could you try an installation of 6.9.11 (perhaps from source)?

Should I have done

"olena-binarize -I OCR-D-N2,OCR-D-IMG -O OCR-D-N3,OCR-D-M3 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.34}'" \

instead?

No, the binarization only takes one input file group. (The derived images referenced in PAGE could be scattered across many file groups, so there's no point in making these explicit.)

PS: with sauvola (no -ms-split) it does work.

Interesting. That does not use 3 color channels (but converts everything to grayscale first). But the IM I/O parts are exactly the same.

And you say this always reproduces after 17 images?

jbarth-ubhd commented 4 years ago

It is always reproducable with this image: https://digi.ub.uni-heidelberg.de/diglitData/jb/23_-_lehrgegenstaende1790a_-_0_030.tif alone or after 16 others before.

Will try imagemagick from source.

bertsky commented 4 years ago

It is always reproducable with this image: https://digi.ub.uni-heidelberg.de/diglitData/jb/23_-_lehrgegenstaende1790a_-_0_030.tif alone or after 16 others before.

Will try imagemagick from source.

I cannot reproduce with IM 6.9.7.4+dfsg-16ub and ocrd_olena master:

ocrd process \
 "olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.08}'" \
 "anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2" \
 "olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p '{\"impl\":\"sauvola-ms-split\",\"k\":0.08}'"
19:35:21.984 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=['OCR-D-N1', 'OCR-D-M1']
Using TensorFlow backend.
19:35:36.669 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.08}'
19:35:39.664 INFO ocrd-olena-binarize - processing image/tiff input file OCR-D-IMG_0001 ()
warning: magick read: sauvola_ms_split: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912
19:35:47.354 INFO ocrd.workspace - Saving mets 'mets.xml'
identify-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
19:35:48.495 INFO ocrd.workspace - Saving mets 'mets.xml'
19:35:48.608 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.08}'
19:35:48.609 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
Using TensorFlow backend.
19:35:50.857 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N1'] output_file_grp=['OCR-D-N2']
OUTPUT FILE  OCR-D-N2
19:35:50.858 INFO OcrdAnybaseocrCropper - No output file group for images specified, falling back to 'OCR-D-IMG-CROP'
19:35:50.858 INFO OcrdAnybaseocrCropper - INPUT FILE 0 / OCR-D-N1_0001
19:36:11.572 INFO ocrd.workspace - created file ID: OCR-D-IMG-CROP_0001, file_grp: OCR-D-IMG-CROP, path: OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png
19:36:11.579 INFO ocrd.workspace - Saving mets 'mets.xml'
19:36:11.996 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
19:36:11.997 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.08}'
19:36:14.987 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-N2_0001 ()
19:36:15.009 INFO ocrd-olena-binarize - found imageFilename '23_-_lehrgegenstaende1790a_-_0_030.tif'
19:36:15.048 DEBUG ocrd-olena-binarize - Using explicitly set page border '0,0 4942,0 4942,8418 0,8418'
convert-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/912.
convert-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
19:36:25.449 INFO ocrd.workspace - Saving mets 'mets.xml'
19:36:26.127 INFO ocrd.workspace - Saving mets 'mets.xml'
19:36:26.244 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.08}'
19:36:26.245 INFO ocrd.cli.process - Finished

If you want I'll try with the old version (which uses the binarized AlternativeImage).

jbarth-ubhd commented 4 years ago

updated ocrd_all; ocrd_olena is now

commit 5449745ef0397b6e7e982e561ecfee43221979e7 (HEAD, origin/master, origin/HEAD)
Author: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Date:   Sat Apr 4 16:45:49 2020 +0200

the second ocrd_olena still takes the already-binarized image(?) - but I tought this has been fixed: If input fileGrp is PAGE-XML, prefer the last AlternativeImage that is _not_ already binarized.

same error (will try other imagemagick soon)

07:30:50.080 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=['OCR-D-N1', 'OCR-D-M1']
Using TensorFlow backend.
07:30:51.867 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
07:30:59.538 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
07:31:02.008 INFO ocrd-olena-binarize - processing image/tiff input file OCR-D-IMG_0001 (P_0001)
warning: magick read: sauvola_ms_split: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/949
07:31:11.800 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
identify-im6.q16: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
identify-im6.q16: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/949.
07:31:12.739 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
07:31:12.818 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
07:31:12.820 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
Using TensorFlow backend.
07:31:14.576 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
07:31:14.604 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N1'] output_file_grp=['OCR-D-N2']
07:31:14.605 INFO OcrdAnybaseocrCropper - OUTPUT FILE OCR-D-N2
07:31:14.605 INFO OcrdAnybaseocrCropper - No output file group for images specified, falling back to 'OCR-D-IMG-CROP'
07:31:14.605 INFO OcrdAnybaseocrCropper - INPUT FILE 0 / P_0001
07:31:29.383 INFO ocrd.workspace - created file ID: OCR-D-IMG-CROP_0001, file_grp: OCR-D-IMG-CROP, path: OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png
07:31:29.388 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
07:31:29.729 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
07:31:29.731 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.34}'
07:31:32.259 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-N2_0001 (P_0001)
07:31:32.282 INFO ocrd-olena-binarize - found AlternativeImage filename 'OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png'
warning: magick read: sauvola_ms_split: No image was loaded.
terminate called after throwing an instance of 'Magick::ErrorCache'
  what():  sauvola_ms_split: no pixels defined in cache `OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png' @ error/cache.c/OpenPixelCache/3906
Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/process.py", line 26, in process_cli
    run_tasks(mets, log_level, page_id, tasks)
  File "/usr/local/ocrd_all/venv/lib/python3.7/site-packages/ocrd/task_sequence.py", line 131, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-olena-binarize exited with non-zero return value 134
Command exited with non-zero status 1
126.39user 130.31system 0:43.80elapsed 585%CPU (0avgtext+0avgdata 2109368maxresident)k
160inputs+3961624outputs (1major+2032576minor)pagefaults 0swaps
jbarth-ubhd commented 4 years ago

complete fresh ocrdall install, with ImageMagick-6.9.11-6 - works. PS: now the second binarize uses original `23-lehrgegenstaende1790a-_0_030.tif`.

08:25:26.141 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=['OCR-D-N1', 'OCR-D-M1']
2020-04-09 08:25:27.688284: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-09 08:25:27.688355: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-09 08:25:27.688362: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Using TensorFlow backend.
08:25:28.235 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
08:25:36.158 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
08:25:38.491 INFO ocrd-olena-binarize - processing image/tiff input file OCR-D-IMG_0001 (P_0001)
warning: magick read: sauvola_ms_split: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/976
08:25:47.455 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
identify: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1698.
08:25:48.491 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:25:48.558 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-IMG -O OCR-D-N1,OCR-D-M1 -p {"impl":"sauvola-ms-split","k":0.34}'
08:25:48.559 INFO ocrd.task_sequence.run_tasks - Start processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
2020-04-09 08:25:50.034302: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-09 08:25:50.034384: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-04-09 08:25:50.034392: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Using TensorFlow backend.
08:25:50.566 WARNING tensorflow - From /usr/local/ocrd_all/venv/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
08:25:50.596 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N1'] output_file_grp=['OCR-D-N2']
08:25:50.597 INFO OcrdAnybaseocrCropper - OUTPUT FILE OCR-D-N2
08:25:50.597 INFO OcrdAnybaseocrCropper - No output file group for images specified, falling back to 'OCR-D-IMG-CROP'
08:25:50.597 INFO OcrdAnybaseocrCropper - INPUT FILE 0 / P_0001
08:26:06.272 INFO ocrd.workspace - created file ID: OCR-D-IMG-CROP_0001, file_grp: OCR-D-IMG-CROP, path: OCR-D-IMG-CROP/OCR-D-IMG-CROP_0001.png
08:26:06.277 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:26:06.702 INFO ocrd.task_sequence.run_tasks - Finished processing task 'anybaseocr-crop -I OCR-D-N1 -O OCR-D-N2'
08:26:06.704 INFO ocrd.task_sequence.run_tasks - Start processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.34}'
08:26:09.237 INFO ocrd-olena-binarize - processing PAGE-XML input file OCR-D-N2_0001 (P_0001)
08:26:09.268 INFO ocrd-olena-binarize - found imageFilename '23_-_lehrgegenstaende1790a_-_0_030.tif'
08:26:09.322 WARNING ocrd-olena-binarize - image URL '23_-_lehrgegenstaende1790a_-_0_030.tif' not referenced
08:26:09.331 DEBUG ocrd-olena-binarize - Using explicitly set page border '0,0 4942,0 4942,8418 0,8418'
convert: Incompatible type for "RichTIFFIPTC"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 34864 (0x8830) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 34866 (0x8832) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 42033 (0xa431) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 42034 (0xa432) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 42036 (0xa434) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: Unknown field with tag 42037 (0xa435) encountered. `TIFFReadCustomDirectory' @ warning/tiff.c/TIFFWarnings/976.
convert: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1698.
08:26:21.177 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:26:21.642 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:26:21.720 INFO ocrd.task_sequence.run_tasks - Finished processing task 'olena-binarize -I OCR-D-N2 -O OCR-D-N3,OCR-D-M3 -p {"impl":"sauvola-ms-split","k":0.34}'
08:26:21.722 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p {"level-of-operation":"page","noise_maxsize":3.0}'
08:26:22.430 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N3'] output_file_grp=['OCR-D-N4']
08:26:22.431 INFO processor.OcropyDenoise - No output file group for images specified, falling back to 'OCR-D-IMG-DESPECK'
08:26:22.431 INFO processor.OcropyDenoise - INPUT FILE 0 / P_0001
08:26:22.563 INFO processor.OcropyDenoise - Page "P_0001" uses 1250.000000 DPI
08:26:22.564 INFO processor.OcropyDenoise - About to despeckle 'OCR-D-IMG-DESPECK_0001'
08:26:28.177 INFO ocrd.workspace - created file ID: OCR-D-IMG-DESPECK_0001, file_grp: OCR-D-IMG-DESPECK, path: OCR-D-IMG-DESPECK/OCR-D-IMG-DESPECK_0001.png
08:26:28.181 INFO processor.OcropyDenoise - created file ID: OCR-D-N4_0001, file_grp: OCR-D-N4, path: OCR-D-N4/OCR-D-N4_0001.xml
08:26:28.184 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:26:28.394 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-denoise -I OCR-D-N3 -O OCR-D-N4 -p {"level-of-operation":"page","noise_maxsize":3.0}'
08:26:28.396 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p {"level-of-operation":"page","maxskew":5}'
08:26:29.106 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N4'] output_file_grp=['OCR-D-N5']
08:26:29.106 INFO processor.OcropyDeskew - No output file group for images specified, falling back to 'OCR-D-IMG-DESKEW'
08:26:29.107 INFO processor.OcropyDeskew - INPUT FILE 0 / P_0001
08:26:29.114 INFO processor.OcropyDeskew - About to deskew page 'P_0001'
08:28:21.631 INFO processor.OcropyDeskew - Found angle for page 'P_0001': -0.2
08:28:23.562 INFO ocrd.workspace - created file ID: OCR-D-IMG-DESKEW_0001, file_grp: OCR-D-IMG-DESKEW, path: OCR-D-IMG-DESKEW/OCR-D-IMG-DESKEW_0001.png
08:28:23.571 INFO processor.OcropyDeskew - created file ID: OCR-D-N5_0001, file_grp: OCR-D-N5, path: OCR-D-N5/OCR-D-N5_0001.xml
08:28:23.574 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:28:23.662 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-deskew -I OCR-D-N4 -O OCR-D-N5 -p {"level-of-operation":"page","maxskew":5}'
08:28:23.664 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6 -p {"padding":5,"find_tables":false}'
08:28:24.184 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N5'] output_file_grp=['OCR-D-N6']
08:28:24.248 INFO processor.TesserocrSegmentRegion - INPUT FILE 0 / P_0001
08:28:24.256 INFO processor.TesserocrSegmentRegion - Page 'P_0001' images will use 1250 DPI from image meta-data
08:28:24.256 INFO processor.TesserocrSegmentRegion - Detecting regions in page 'P_0001'
08:28:31.378 INFO processor.TesserocrSegmentRegion - Detected region 'region0000': 886,609 4860,592 4861,936 887,953 (FLOWING_TEXT)
08:28:31.383 INFO processor.TesserocrSegmentRegion - Detected region 'region0001': 567,1043 4945,1023 4959,4346 581,4366 (FLOWING_TEXT)
08:28:31.383 INFO processor.TesserocrSegmentRegion - Detected region 'region0002': 748,4549 4935,4531 4949,7664 762,7683 (FLOWING_TEXT)
08:28:31.383 INFO processor.TesserocrSegmentRegion - Detected region 'region0003': 4909,1814 4946,1813 4949,2482 4912,2483 (VERTICAL_TEXT)
08:28:31.384 INFO processor.TesserocrSegmentRegion - Detected region 'region0004': 4728,3123 4797,3123 4798,3381 4729,3381 (VERTICAL_TEXT)
08:28:31.385 INFO processor.TesserocrSegmentRegion - Detected region 'region0005': 4684,354 4770,353 4804,8081 4718,8081 (VERT_LINE)
08:28:31.386 INFO processor.TesserocrSegmentRegion - Detected region 'region0006': -37,0 447,-2 484,8427 0,8429 (FLOWING_IMAGE)
08:28:31.484 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:28:31.572 INFO ocrd.task_sequence.run_tasks - Finished processing task 'tesserocr-segment-region -I OCR-D-N5 -O OCR-D-N6 -p {"padding":5,"find_tables":false}'
08:28:31.574 INFO ocrd.task_sequence.run_tasks - Start processing task 'segment-repair -I OCR-D-N6 -O OCR-D-N7 -p {"plausibilize":true}'
08:28:31.972 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N6'] output_file_grp=['OCR-D-N7']
08:28:31.972 INFO processor.RepairSegmentation - INPUT FILE 0 / P_0001
08:28:31.975 WARNING processor.RepairSegmentation - Region "region0001" extends beyond Border of page "P_0001"
08:28:31.975 WARNING processor.RepairSegmentation - Region "region0002" extends beyond Border of page "P_0001"
08:28:31.975 WARNING processor.RepairSegmentation - Region "region0003" extends beyond Border of page "P_0001"
08:28:31.975 WARNING processor.RepairSegmentation - Region "region0006" extends beyond Border of page "P_0001"
08:28:31.976 WARNING processor.RepairSegmentation - Page "P_0001" region "region0004" is within "region0001" (removing)
08:28:31.976 WARNING processor.RepairSegmentation - Page "P_0001" region "region0003" is within "region0001" (removing)
08:28:31.977 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:28:32.025 INFO ocrd.task_sequence.run_tasks - Finished processing task 'segment-repair -I OCR-D-N6 -O OCR-D-N7 -p {"plausibilize":true}'
08:28:32.027 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8'
08:28:32.682 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N7'] output_file_grp=['OCR-D-N8']
08:28:32.682 INFO processor.OcropyClip - No output file group for images specified, falling back to 'OCR-D-IMG-CLIP'
08:28:32.683 INFO processor.OcropyClip - INPUT FILE 0 / P_0001
08:28:32.691 INFO processor.OcropyClip - Page "P_0001" uses 1250.000000 DPI
08:28:43.084 INFO ocrd.workspace - created file ID: OCR-D-IMG-CLIP_0001_region0000, file_grp: OCR-D-IMG-CLIP, path: OCR-D-IMG-CLIP/OCR-D-IMG-CLIP_0001_region0000.png
08:28:51.500 INFO ocrd.workspace - created file ID: OCR-D-IMG-CLIP_0001_region0001, file_grp: OCR-D-IMG-CLIP, path: OCR-D-IMG-CLIP/OCR-D-IMG-CLIP_0001_region0001.png
08:29:00.059 INFO ocrd.workspace - created file ID: OCR-D-IMG-CLIP_0001_region0002, file_grp: OCR-D-IMG-CLIP, path: OCR-D-IMG-CLIP/OCR-D-IMG-CLIP_0001_region0002.png
08:29:00.077 INFO processor.OcropyClip - created file ID: OCR-D-N8_0001, file_grp: OCR-D-N8, path: OCR-D-N8/OCR-D-N8_0001.xml
08:29:00.093 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:29:00.193 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-clip -I OCR-D-N7 -O OCR-D-N8'
08:29:00.196 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-segment -I OCR-D-N8 -O OCR-D-N9 -p {"spread":2.4}'
08:29:00.842 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N8'] output_file_grp=['OCR-D-N9']
08:29:00.843 INFO processor.OcropySegment - INPUT FILE 0 / P_0001
08:29:00.851 INFO processor.OcropySegment - Page "P_0001" uses 1250.000000 DPI
08:30:14.899 WARNING ocrolib - line 2 has extreme height (334 vs 157)
08:30:15.003 WARNING ocrolib - line 5 has extreme height (404 vs 157)
08:32:18.655 INFO processor.OcropySegment - created file ID: OCR-D-N9_0001, file_grp: OCR-D-N9, path: OCR-D-N9/OCR-D-N9_0001.xml
08:32:18.664 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:32:18.859 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-segment -I OCR-D-N8 -O OCR-D-N9 -p {"spread":2.4}'
08:32:18.861 INFO ocrd.task_sequence.run_tasks - Start processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10'
08:32:19.557 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N9'] output_file_grp=['OCR-D-N10']
08:32:19.558 INFO processor.OcropyDewarp - No output file group for images specified, falling back to 'OCR-D-IMG-DEWARP'
08:32:19.558 INFO processor.OcropyDewarp - INPUT FILE 0 / P_0001
08:32:19.567 INFO processor.OcropyDewarp - Page "P_0001" uses 1250.000000 DPI
08:32:20.584 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0000' line 'region0000_line0000'
08:32:26.035 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0000_region0000_line0000, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0000_region0000_line0000.png
08:32:27.035 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0000'
08:32:28.601 ERROR processor.OcropyDewarp - cannot dewarp line "region0001_line0000": found more than 1 textline (only 0.61 fg), most likely from bad cropping
08:32:28.838 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0001'
08:32:28.899 WARNING processor.OcropyDewarp - cannot dewarp line "region0001_line0001": too many connected components (got 144, wanted <=20)
08:32:29.001 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0001, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0001.png
08:32:29.222 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0002'
08:32:29.608 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0002, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0002.png
08:32:29.831 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0003'
08:32:29.876 WARNING processor.OcropyDewarp - cannot dewarp line "region0001_line0003": image too tall for a text line (880, 3813)
08:32:29.974 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0003, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0003.png
08:32:30.212 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0004'
08:33:04.601 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0004, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0004.png
08:33:04.809 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0005'
08:33:09.955 ERROR processor.OcropyDewarp - cannot dewarp line "region0001_line0005": found more than 1 textline (only 0.54 fg), most likely from bad cropping
08:33:10.187 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0006'
08:33:25.827 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0006, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0006.png
08:33:26.055 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0007'
08:33:37.023 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0001_region0001_line0007, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0001_region0001_line0007.png
08:33:37.252 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0001' line 'region0001_line0008'
08:33:41.150 ERROR processor.OcropyDewarp - cannot dewarp line "region0001_line0008": found more than 1 textline (only 0.77 fg), most likely from bad cropping
08:33:42.103 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0000'
08:33:42.154 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0000": too many connected components (got 147, wanted <=20)
08:33:42.240 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0000, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0000.png
08:33:42.450 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0001'
08:33:42.458 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0001": too many connected components (got 29, wanted <=11)
08:33:42.474 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0001, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0001.png
08:33:42.669 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0002'
08:33:42.707 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0002": image too tall for a text line (937, 3818)
08:33:42.805 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0002, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0002.png
08:33:42.977 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0003'
08:33:43.019 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0003": too many connected components (got 142, wanted <=20)
08:33:43.104 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0003, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0003.png
08:33:43.271 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0004'
08:33:43.277 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0004": image too tall for a text line (858, 757)
08:33:43.296 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0004, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0004.png
08:33:43.493 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0005'
08:33:44.554 ERROR processor.OcropyDewarp - cannot dewarp line "region0002_line0005": found more than 1 textline (only 0.63 fg), most likely from bad cropping
08:33:44.748 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0006'
08:33:44.801 WARNING processor.OcropyDewarp - cannot dewarp line "region0002_line0006": too many connected components (got 187, wanted <=21)
08:33:44.900 INFO ocrd.workspace - created file ID: OCR-D-IMG-DEWARP_0001_region0002_region0002_line0006, file_grp: OCR-D-IMG-DEWARP, path: OCR-D-IMG-DEWARP/OCR-D-IMG-DEWARP_0001_region0002_region0002_line0006.png
08:33:45.072 INFO processor.OcropyDewarp - About to dewarp page 'P_0001' region 'region0002' line 'region0002_line0007'
08:33:53.501 ERROR processor.OcropyDewarp - cannot dewarp line "region0002_line0007": found more than 1 textline (only 0.88 fg), most likely from bad cropping
08:33:53.503 INFO processor.OcropyDewarp - created file ID: OCR-D-N10_0001, file_grp: OCR-D-N10, path: OCR-D-N10/OCR-D-N10_0001.xml
08:33:53.516 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
# CenterNormalizer
08:33:53.617 INFO ocrd.task_sequence.run_tasks - Finished processing task 'cis-ocropy-dewarp -I OCR-D-N9 -O OCR-D-N10'
08:33:53.619 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-recognize -I OCR-D-N10 -O OCR-D-OCR -p {"textequiv_level":"glyph","overwrite_words":true,"model":"GT4HistOCR_50000000.575_401209"}'
08:33:54.155 INFO ocrd.workspace_validator - input_file_grp=['OCR-D-N10'] output_file_grp=['OCR-D-OCR']
08:33:54.552 INFO processor.TesserocrRecognize - Using model 'GT4HistOCR_50000000.575_401209' in /usr/local/ocrd_all/venv/share/tessdata/ for recognition at the glyph level
08:33:54.553 INFO processor.TesserocrRecognize - INPUT FILE 0 / P_0001
08:33:54.563 INFO processor.TesserocrRecognize - Page 'P_0001' images will use 1250 DPI from image meta-data
08:33:54.563 INFO processor.TesserocrRecognize - Processing page 'P_0001'
08:33:55.567 ERROR ocrd.workspace - segment "region0000_line0000" image (binarized,despeckled,dewarped; 3973x372) has not been cropped properly (3973x343)
08:33:57.176 ERROR ocrd.workspace - segment "region0001_line0001" image (binarized,despeckled,dewarped; 4140x1244) has not been cropped properly (4140x830)
08:33:57.652 ERROR ocrd.workspace - segment "region0001_line0002" image (binarized,despeckled,dewarped; 443x354) has not been cropped properly (443x272)
08:33:57.948 ERROR ocrd.workspace - segment "region0001_line0003" image (binarized,despeckled,dewarped; 3813x1320) has not been cropped properly (3813x880)
08:33:58.426 ERROR ocrd.workspace - segment "region0001_line0004" image (binarized,despeckled,dewarped; 4377x818) has not been cropped properly (4377x825)
08:33:59.430 ERROR ocrd.workspace - segment "region0001_line0006" image (binarized,despeckled,dewarped; 3730x846) has not been cropped properly (3730x602)
08:33:59.920 ERROR ocrd.workspace - segment "region0001_line0007" image (binarized,despeckled,dewarped; 3864x810) has not been cropped properly (3864x495)
08:34:01.638 ERROR ocrd.workspace - segment "region0002_line0000" image (binarized,despeckled,dewarped; 3719x1143) has not been cropped properly (3719x763)
08:34:02.082 ERROR ocrd.workspace - segment "region0002_line0001" image (binarized,despeckled,dewarped; 1223x697) has not been cropped properly (1223x465)
08:34:02.356 ERROR ocrd.workspace - segment "region0002_line0002" image (binarized,despeckled,dewarped; 3818x1405) has not been cropped properly (3818x937)
08:34:02.822 ERROR ocrd.workspace - segment "region0002_line0003" image (binarized,despeckled,dewarped; 3710x1170) has not been cropped properly (3710x780)
08:34:02.952 WARNING processor.TesserocrRecognize - No text in line 'region0002_line0003'
08:34:03.156 ERROR ocrd.workspace - segment "region0002_line0004" image (binarized,despeckled,dewarped; 757x1286) has not been cropped properly (757x858)
08:34:03.183 WARNING processor.TesserocrRecognize - No text in line 'region0002_line0004'
08:34:03.671 ERROR ocrd.workspace - segment "region0002_line0006" image (binarized,despeckled,dewarped; 4177x1245) has not been cropped properly (4177x831)
08:34:04.477 INFO ocrd.workspace - Saving mets '/usr/local/jb/ocrd-variants/x,sauvola-ms-split,0.34,sachunsky,tesseract-5M/mets.xml'
08:34:04.554 INFO ocrd.task_sequence.run_tasks - Finished processing task 'tesserocr-recognize -I OCR-D-N10 -O OCR-D-OCR -p {"textequiv_level":"glyph","overwrite_words":true,"model":"GT4HistOCR_50000000.575_401209"}'
08:34:04.555 INFO ocrd.cli.process - Finished
615.21user 204.05system 8:38.85elapsed 157%CPU (0avgtext+0avgdata 2175028maxresident)k
203832inputs+518600outputs (118major+14157117minor)pagefaults 0swaps
bertsky commented 4 years ago

the second ocrd_olena still takes the already-binarized image

re-install ocrd_olena?

complete fresh ocrd_all install, with ImageMagick-6.9.11-6 - works

For the record: So was it the IM update or the ocrd_olena update that made the difference?

jbarth-ubhd commented 4 years ago

the second ocrd_olena still takes the already-binarized image re-install ocrd_olena?

I did

git pull
git submodule sync
git submodule update --init --recursive

but forgot make all

complete fresh ocrd_all install, with ImageMagick-6.9.11-6 - works For the record: So was it the IM update or the ocrd_olena update that made the difference?

I think IM.

bertsky commented 4 years ago

I think IM.

Then we have to keep in mind that certain IM versions between 6.9.7.4 and 6.9.11-6 don't work.