Closed krvoigt closed 2 years ago
https://github.com/OCR-D/assets/tree/master/data/kant_aufklaerung_1784 is the most widely used sample project.
I was testing the execution of ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model Fraktur_GT4HistOCR -I MAX -O OCR-D-OCR"
with an existing METS file. It seems that there is an image missing because while using workspace download
it threw a 500 error at one image.
So I decided to process the successfully downloaded files anyway. Then tesserocr-recognize
threw the execption Not already downloaded, moving on
and then apparently tried to download that file and got again the 500 error back upon which it canceled the execution.
My question is, shouldn't the processors just take the workspace and try their best on it without retrying to download the images? Isn't it too much responsibility for a processor?
My question is, shouldn't the processors just take the workspace and try their best on it without retrying to download the images? Isn't it too much responsibility for a processor?
Since the processors are iterating over the pages themselves, and do a basic download / process
loop, it's currently the job of the processors to handle this. But you're right, the practice of downloading on the fly and changing the URL of images is problematic. One way to make this work is to remove the offending mets:file
or explicitly provide page IDs to process, excluding the one for the probematic file. Also don't forget telling the colleagues from the GDZ about the 500 .
@paulpestov creates an epic about "processors on demand downloading" for further discussion.
for comparing the results use https://github.com/hnesk/browse-ocrd/
New epic here #58
https://pad.gwdg.de/8g20Q98xQoy-UpO3-zHnJA?view @mweidling testing our workflows
Thank you :)
I transferred the open task to a new ticket (see above) and will close this one.
As a ocr-d developer I would like to process a test run with example data to get an idea what part of the software works well and what are areas for improvement.