OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

Fix workspace handling of local files #342

Closed wrznr closed 4 years ago

wrznr commented 5 years ago

As shown in the Lobby, I encounter problems when trying to create a workspace from existing (local) files:

$ ocrd workspace init .
$ ocrd workspace add 00009p.xml -G GT -i 00009_gt -g 00009 -m 'application/vnd.prima.page+xml'
$ ocrd workspace add 00009.tif -G IMG -i 00009_img -g 00009 -m 'image/tiff'

Running ocrd-tesserocr-binarize leads to

$ ocrd-tesserocr-binarize -I GT -O BIN -p '{"operation_level": "line"}'
10:46:23.324 INFO processor.TesserocrBinarize - No output file group for images specified, falling back to 'OCR-D-IMG-BIN'
10:46:23.442 INFO processor.TesserocrBinarize - INPUT FILE 0 / 00009
Traceback (most recent call last):
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 109, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: 00009.tif

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kmw/Documents/Work/OCR-D/env/bin/ocrd-tesserocr-binarize", line 10, in <module>
    sys.exit(ocrd_tesserocr_binarize())
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 45, in ocrd_tesserocr_binarize
    return ocrd_cli_wrap_processor(TesserocrBinarize, *args, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/decorators.py", line 66, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/processor/base.py", line 56, in run_processor
    processor.process()
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd_tesserocr/binarize.py", line 82, in process
    page, page_id)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 320, in image_from_page
    page_image = self._resolve_image_as_pil(page.imageFilename)
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 237, in _resolve_image_as_pil
    image_filename = self.download_file(f).local_filename
  File "/home/kmw/Documents/Work/OCR-D/env/lib/python3.6/site-packages/ocrd/workspace.py", line 112, in download_file
    raise Exception("No baseurl defined by workspace. Cannot retrieve '%s'" % f.url)
Exception: No baseurl defined by workspace. Cannot retrieve '00009.tif'

In addition, a file 00009_gt.xml is created in the GT directory.

00009.zip

kba commented 5 years ago

The problem stands that input image should not be copied at all, needs investigating.

kba commented 4 years ago

Another flaw in the logic of download_to_directory. It SHOULD recognize that the source files are already in the workspace but does not, leading to copies of all input files...