Closed SB2020-eye closed 3 years ago
Thanks for your report. It now becomes apparent, there has been a misunderstanding:
ocrd-import -C -P (create folder OCR-D-BIN) (put binary image in OCR-D-BIN) ocrd workspace add -g P_00002 -G OCR-D-BIN -i OCR-D-BIN_00001 -m image/png OCR-D-BIN/Folio_073r-Enhanced2x_mask.png ocrd-preprocess-image -P output_feature_added binarized -P level-of-operation page -P command "cp @INFILE @OUTFILE" -I OCR-D-IMG -O OCR-D-BIN --overwrite
The last command uses the wrong wiring. While you wanted to mark the (page-level, manually created) image under OCR-D-BIN as binarized
, this instead marks the (original, very large) image file under OCR-D-IMG. So OCR-D-BIN will now contain two "pages":
ocrd-import
as page Folio_073r-Enhanced2x
and (incorrectly) marked as binarizedocrd workspace add
as page P_00002
and not marked at all (not even wrapped in a PAGE)Change your workflow to this:
mkdir workspace
cp path/to/original-images/*.png workspace
ocrd-import -C -P workspace
mkdir workspace/OCR-D-BIN
cp path/to/binarized-images/*.png workspace/OCR-D-BIN
cd workspace
ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
ocrd-preprocess-image -P output_feature_added binarized -P command "cp @INFILE @OUTFILE" -I OCR-D-BIN -O OCR-D-BIN-PAGE
ocrd-anybaseocr-crop -I OCR-D-BIN-PAGE -O OCR-D-CROP
I used the bulk-add recipe because I assume you ultimately want to process multiple files. This should yield the same pageId identifiers as ocrd-import -P
, but only if you used the same base names for your manually crafted images. You could also skip copying and importing the (much larger) original images, because the OCR-D workflow from here on will not use or relate to them. (And since they are that large, I would not recommend trying anyway.)
BTW The reason for ocrd-anybaseocr-crop to take such an insane amount of time just to enter its main function and immediately exit with an error was definitely the huge image resolution. We currently cannot deal with such large objects well in OCR-D. So I this issue can be closed IMO – but we can still continue discussing your workflow / use-case here if needed.
Many thanks, @bertsky . I will definitely close this up v soon.
If you don't mind, though...
I got as far as
ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
but got the following AttributeError:
(venv) scott@Yogi:/mnt/c/Users/Scott/Desktop/Python2/Kells/workspace$ ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ fileGrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
20:53:24.276 INFO ocrd.cli.workspace.bulk-add - [ 0/1] /mnt/c/Users/Scott/Desktop/Python2/Kells/workspace/OCR-D-BIN/Folio_073r-Enhanced2x_bin.png
Traceback (most recent call last):
File "/home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 300, in workspace_cli_bulk_add
file_dict[param_name] = file_dict[param_name].replace('{{ %s }}' % group_name, group_dict[group_name])
AttributeError: 'NoneType' object has no attribute 'replace'
I'm sorry, misspelled the -u parameter. Should be filegrp
instead of fileGrp
Thx. I was hoping it was something small like that...any chance there's an additional such small thing? edited command and output:
ocrd workspace bulk-add -r '^.*/(?P<filegrp>[^/]+)/(?P<pageid>.*)\.(?P<ext>[^\.]*)$' -G '{{ filegrp }}' -i 'FILE_{{ filegrp }}_{{ pageid }}' -u '{{ filegrp }}/{{ pageid }}.{{ ext }}' OCR-D-BIN/*.png
09:59:58.930 INFO ocrd.cli.workspace.bulk-add - [ 0/1] /mnt/c/Users/Scott/Desktop/Python2/Kells/workspace/OCR-D-BIN/Folio_073r-Enhanced2x_bin.png
Traceback (most recent call last):
File "/home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/scott/src/github/OCR-D/ocrd_all/venv/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 300, in workspace_cli_bulk_add
file_dict[param_name] = file_dict[param_name].replace('{{ %s }}' % group_name, group_dict[group_name])
AttributeError: 'NoneType' object has no attribute 'replace'
Sorry again. I think I forgot -g "{{ pageid }}"
but not sure whether that explains the error message yet.
I believe that worked. No error, at least. Thank you for bearing with me on this!
I am trying to run
ocrd-anybaseocr-crop
within a workspace that consists of a very large image file and a binarized version of that image added manually. I'm on Windows 10, using WSL and Ubuntu 18.04.My workflow is below. (Non-commands are in parentheses.)
Once I got to the last command above (crop), after hitting enter, my computer ran for about 4 and a half hours, then gave
The resulting PAGE-XML input file is below.