OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
117 stars 31 forks source link

bashlib input_files: ensure download_file (as in all Pythonic processors) #1216

Closed bertsky closed 2 months ago

bertsky commented 2 months ago

All of our processors written in Python use a Workspace.download_file(input_file) in their processing loop. This ensures the file is available locally, even if it was still a URL (saving it under a reproducable temporary path).

Unfortunately, our bashlib processors have no chance to get that behaviour: the ocrd workspace find --download would inevitably persist the downloaded file, which is perhaps not entirely wrong, but different from the Python processors. Regardless, it's not what we do in ocrd_olena, ocrd_pagetopdf, ocrd_fileformat, ocrd_im6convert etc.

Hence, if the input fileGrp is entirely remote, we only get messages like this:

ERROR ocrd.ocrd-olena-binarize - input file ID=FILE_0024_DEFAULT (pageId=PHYS_0024 MIME=image/jpg) is not on disk

The result would be a successful run without actual output fileGrp:

Exception: Invalid state: expected output file group 'OCR-D-BIN' not in METS (despite processor success)

Now, the solution I propose is simple: have ocrd bashlib input-files (which does have access to Workspace.download_file(input_file)) do the job!