OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Common understanding about the workflow requirements #48

Closed krvoigt closed 2 years ago

krvoigt commented 2 years ago

As a team, we need a summary of the past discussion and the resulting requirements so that we can discuss them together and generate our next steps from them.

kba commented 2 years ago
cneud commented 2 years ago

summary of the past discussion

The aim for a workflow component for OCR-D goes back to the initial phase I of the project. I'll try to summarize the high level discussions so far.

A strength of OCR-D lies in its modularity and flexibility in the use of Processors, with often more than one Processor being available for a given task (or step). Experiences, tests and user expectations have conveyed that - due to the complex and very diverse nature of the documents to be processed - often only tailor made workflows can guarantee the optimal result quality.

The OCR-D functional model shows that OCR is always composed from a sequence of such steps, with the condition that the output (image, text or metadata) from a single Processor must also be made available for subsequent Processors to make use of (theoretically, also loops are conceivable).

In practice, workflows are often specific to an institution or project, composed of ad-hoc solutions or local scripts and lacking standardized interfaces, descriptions and documentation. This greatly hinders reuse, comparability and replicability.

So far we can summarize the following needs, or rather benefits, of a standardized workflow management component:

A bit further down the line, standardized workflows could also be used for discovery (i.e. finding an appropriate workflow for a set of documents), or to dynamically compose sequences of Processors merely based on a workflow description.

Other aspects that could greatly benefit from standardized workflows are scalability/parallelization (e.g. while some steps require more time to complete, other steps could potentially already be started and run in parallel to reduce the overall execution time) and error handling/robustness (i.e. the ability to overcome exceptions or crashes of individual Processors without stopping the overall workflow execution).

A first steps towards standardized workflows was made with

Old WF repo: https://github.com/OCR-D/ocrd-workflows

which basically encapsulates workflows as a simple sequence of Processors in Posix shell scripts. This has a few drawbacks e.g. with regard to the validation of the sequence and the data exchange between Processors (see also the example and explanation in the wiki on workflows), or the capture of global provenance information.

Old WF format: https://github.com/OCR-D/spec/pull/171

tries to specify a standardized format for such workflows and also has a good discussion of the benefits/drawbacks of this approach, while also touching on alternatives like CWL. [Note: while initially also Taverna/SCUFL2 was a candidate, with it being retired as an Apache incubator early in 2020, we eventually discarded it]

WF Server: https://github.com/OCR-D/core/pull/652

is an implementation of a workflow server based on https://github.com/OCR-D/spec/pull/171 that already aims to overcome some of the problems in parallelization and error recovery that are well explained under

RFC: Preloading OCR-D Processors: https://hackmd.io/23-JzLp_Q96cb6T0ttoFIA

Last but not least, another relevant piece of software in this context is

Ocrd Controller: https://github.com/bertsky/ocrd_controller

which provides a network interface (via SSH) for executing Processors based on a local installation of ocrd_all, and which could be exposed to third party software such as e.g. Kitodo.

krvoigt commented 2 years ago

we need concrete requirements. There is a difference between workflow requirements and requirements of the processors. we need to provide an excample workflow and ask IMPL for agreement. Triet will take one OCR-D example and will prepare an CWL for NextFlow.