OCR-D / ocrd_anybaseocr

DFKI Layout Detection for OCR-D
Apache License 2.0
48 stars 12 forks source link

ocrd-anybaseocr-crop not running with binary added to existing workspace #85

Closed SB2020-eye closed 3 years ago

SB2020-eye commented 3 years ago

I am trying to run ocrd-anybaseocr-crop within a workspace that consists of a very large image file and a binarized version of that image added manually. I'm on Windows 10, using WSL and Ubuntu 18.04.

My workflow is below. (Non-commands are in parentheses.)

(create folder sbb_test_enhancedx2_b as new workspace)
cd /mnt/c/Users/Scott/Desktop/Python2/Kells/sbb_test_enhancedx2_b
ocrd-import -C -P
(create folder OCR-D-BIN)
(put binary image in OCR-D-BIN)
ocrd workspace add -g P_00002 -G OCR-D-BIN -i OCR-D-BIN_00001 -m image/png OCR-D-BIN/Folio_073r-Enhanced2x_mask.png
ocrd-preprocess-image -P output_feature_added binarized -P level-of-operation page -P command "cp @INFILE @OUTFILE" -I OCR-D-IMG -O OCR-D-BIN --overwrite
ocrd-anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP

Once I got to the last command above (crop), after hitting enter, my computer ran for about 4 and a half hours, then gave

Exception: Found no AlternativeImage that satisfies all requirements selector="binarized" in page "P_00002"

The resulting PAGE-XML input file is below.

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-D-BIN_Folio_073r-Enhanced2x">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.23.2</pc:Creator>
        <pc:Created>2021-04-16T21:13:14.104408</pc:Created>
        <pc:LastChange>2021-04-16T21:13:14.104408</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization" value="ocrd-preprocess-image">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="binarized" type="output_feature_added"/>
                <pc:Label value="page" type="level-of-operation"/>
                <pc:Label value="cp @INFILE @OUTFILE" type="command"/>
                <pc:Label value="" type="input_feature_selector"/>
                <pc:Label value="" type="input_feature_filter"/>
                <pc:Label value="image/png" type="input_mimetype"/>
                <pc:Label value="image/png" type="output_mimetype"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="0.1.7" type="ocrd-preprocess-image"/>
                <pc:Label value="2.23.2" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="Folio_073r-Enhanced2x.png" imageWidth="9732" imageHeight="12600">
        <pc:AlternativeImage filename="OCR-D-BIN/OCR-D-BIN_Folio_073r-Enhanced2x.IMG-BINARIZED.png" comments=",binarized"/>
    </pc:Page>
</pc:PcGts>
bertsky commented 3 years ago

Thanks for your report. It now becomes apparent, there has been a misunderstanding:

ocrd-import -C -P (create folder OCR-D-BIN) (put binary image in OCR-D-BIN) ocrd workspace add -g P_00002 -G OCR-D-BIN -i OCR-D-BIN_00001 -m image/png OCR-D-BIN/Folio_073r-Enhanced2x_mask.png ocrd-preprocess-image -P output_feature_added binarized -P level-of-operation page -P command "cp @INFILE @OUTFILE" -I OCR-D-IMG -O OCR-D-BIN --overwrite

The last command uses the wrong wiring. While you wanted to mark the (page-level, manually created) image under OCR-D-BIN as binarized, this instead marks the (original, very large) image file under OCR-D-IMG. So OCR-D-BIN will now contain two "pages":

  1. the original image from ocrd-import as page Folio_073r-Enhanced2x and (incorrectly) marked as binarized
  2. the binarized image from ocrd workspace add as page P_00002 and not marked at all (not even wrapped in a PAGE)

Change your workflow to this:

mkdir workspace
cp path/to/original-images/*.png workspace
ocrd-import -C -P workspace
mkdir workspace/OCR-D-BIN
cp path/to/binarized-images/*.png workspace/OCR-D-BIN
cd workspace
ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
ocrd-preprocess-image -P output_feature_added binarized -P command "cp @INFILE @OUTFILE" -I OCR-D-BIN -O OCR-D-BIN-PAGE
ocrd-anybaseocr-crop -I OCR-D-BIN-PAGE -O OCR-D-CROP

I used the bulk-add recipe because I assume you ultimately want to process multiple files. This should yield the same pageId identifiers as ocrd-import -P, but only if you used the same base names for your manually crafted images. You could also skip copying and importing the (much larger) original images, because the OCR-D workflow from here on will not use or relate to them. (And since they are that large, I would not recommend trying anyway.)

BTW The reason for ocrd-anybaseocr-crop to take such an insane amount of time just to enter its main function and immediately exit with an error was definitely the huge image resolution. We currently cannot deal with such large objects well in OCR-D. So I this issue can be closed IMO – but we can still continue discussing your workflow / use-case here if needed.

SB2020-eye commented 3 years ago

Many thanks, @bertsky . I will definitely close this up v soon.

If you don't mind, though...

I got as far as ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png but got the following AttributeError:

(venv) scott@Yogi:/mnt/c/Users/Scott/Desktop/Python2/Kells/workspace$ ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
20:53:24.276 INFO ocrd.cli.workspace.bulk-add - [   0/1] /mnt/c/Users/Scott/Desktop/Python2/Kells/workspace/OCR-D-BIN/Folio_073r-Enhanced2x_bin.png
Traceback (most recent call last):
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 300, in workspace_cli_bulk_add
    file_dict[param_name] = file_dict[param_name].replace('{{ %s }}' % group_name, group_dict[group_name])
AttributeError: 'NoneType' object has no attribute 'replace'
bertsky commented 3 years ago

I'm sorry, misspelled the -u parameter. Should be filegrp instead of fileGrp

SB2020-eye commented 3 years ago

Thx. I was hoping it was something small like that...any chance there's an additional such small thing? edited command and output:

ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ filegrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
09:59:58.930 INFO ocrd.cli.workspace.bulk-add - [   0/1] /mnt/c/Users/Scott/Desktop/Python2/Kells/workspace/OCR-D-BIN/Folio_073r-Enhanced2x_bin.png
Traceback (most recent call last):
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 300, in workspace_cli_bulk_add
    file_dict[param_name] = file_dict[param_name].replace('{{ %s }}' % group_name, group_dict[group_name])
AttributeError: 'NoneType' object has no attribute 'replace'
bertsky commented 3 years ago

Sorry again. I think I forgot -g "{{ pageid }}" but not sure whether that explains the error message yet.

SB2020-eye commented 3 years ago

I believe that worked. No error, at least. Thank you for bearing with me on this!