Test run with example data

OCR-D / zenhub

Repo for developing zenhub integration

Apache License 2.0

0 stars 0 forks source link

Test run with example data #43

Closed krvoigt closed 2 years ago

krvoigt commented 2 years ago

As a ocr-d developer I would like to process a test run with example data to get an idea what part of the software works well and what are areas for improvement.

[x] setup Workspace (do problems arise already in the workspace or does it happen during processing?)
[x] good example testing data
[x] document the testing to enable other developers to reuse it (maybe as issue in assets)

kba commented 2 years ago

extend https://github.com/OCR-D/assets

kba commented 2 years ago

https://github.com/OCR-D/assets/tree/master/data/kant_aufklaerung_1784 is the most widely used sample project.

paulpestov commented 2 years ago

I was testing the execution of ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model Fraktur_GT4HistOCR -I MAX -O OCR-D-OCR" with an existing METS file. It seems that there is an image missing because while using workspace download it threw a 500 error at one image. So I decided to process the successfully downloaded files anyway. Then tesserocr-recognize threw the execption Not already downloaded, moving on and then apparently tried to download that file and got again the 500 error back upon which it canceled the execution.

My question is, shouldn't the processors just take the workspace and try their best on it without retrying to download the images? Isn't it too much responsibility for a processor?

kba commented 2 years ago

My question is, shouldn't the processors just take the workspace and try their best on it without retrying to download the images? Isn't it too much responsibility for a processor?

Since the processors are iterating over the pages themselves, and do a basic download / process loop, it's currently the job of the processors to handle this. But you're right, the practice of downloading on the fly and changing the URL of images is problematic. One way to make this work is to remove the offending mets:file or explicitly provide page IDs to process, excluding the one for the probematic file. Also don't forget telling the colleagues from the GDZ about the 500 .

krvoigt commented 2 years ago

@paulpestov creates an epic about "processors on demand downloading" for further discussion.

krvoigt commented 2 years ago

for comparing the results use https://github.com/hnesk/browse-ocrd/

paulpestov commented 2 years ago

New epic here #58

kba commented 2 years ago

https://pad.gwdg.de/8g20Q98xQoy-UpO3-zHnJA?view @mweidling testing our workflows

mweidling commented 2 years ago

Thank you :)

mweidling commented 2 years ago

I transferred the open task to a new ticket (see above) and will close this one.