OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

start describing workflow format #208

Closed kba closed 2 years ago

MehmedGIT commented 2 years ago

My initial idea for the workflow format is as follows:

entry-point: INPUT-FOLDER

ocrd: processor name 1
   - parameters: {}

ocrd: processor name 2
   - parameters: {}

...

ocrd: processor name N
   - parameters: {}

So, here is a brief example of an OCR-D workflow:

entry-point: OCR-D-IMG

ocrd: cis-ocropy-binarize
   - parameters: {}

ocrd: anybaseocr-crop
   - parameters: {}

...

ocrd: tesserocr-recognize 
   - parameters: {
     "textequiv_level": "glyph",
     "overwrite_segments": "true",
     "model": "GT4HistOCR_50000000.997_191951"
     }

Notes:

  1. The execution order of the processors matches the description order in the workflow file.
  2. We provide default OUTPUT folder names based on the used processors. So, we just need the initial input folder. In case the user prefers to provide their own output folder names for specific processors, we can still allow that with an additional option to the processors such as: ` - output: OUTPUT-FOLDER".
  3. The INPUT-OUTPUT is matched automatically depending on the defaults, or replaced by the output options (if provided).
  4. Based on the needs we could further extend the list of available options.

Of course, a more advanced variation is also possible. However, the more complex it gets, the harder it would be for the general user. For example, having separate blocks for different processing categories. Consider the following format:

entry-point: INPUT-FOLDER

step1: image-optimization-page-level
   - ocrd: processor name 1
      - parameters: {}

   - ocrd: processor name 2
      - parameters: {}

step2: layout-analysis
   - ocrd: processor name1
      - parameters: {}

... 

stepN: text-recognition
   - ocrd: processor name1
      - parameters: {}

...  

stepN+M: generic-data-management
   - archiving:
      - ocrd: processor name 1 (e.g., olahd-client)

Notes:

  1. Based on the step levels we could provide different options (as required).
tdoan2010 commented 2 years ago

Why do you come up with a new workflow format? We have already agreed that we will use the Nextflow syntax. As users, they will submit the whole Nextflow file (.nf) to the Workflow Server for the execution.

Or do you have any opinion against that approach?

MehmedGIT commented 2 years ago

In our last meeting on Friday, I mentioned providing a simpler workflow format which I can then translate to a Nextflow script. There were no objections. So I thought we are going to offer two ways of providing workflows: 1) A basic workflow format - for basic workflows 2) A Nextflow script file (.nf) - for more advanced workflows

Update after today's planning: Since we agreed to use the Nextflow format in general, providing a basic workflow format is not required.

Example Nextflow workflows are available here.

Soon, I will provide Nextflow scripts for the recommended OCR-D workflows here.

kba commented 2 years ago

Questions the documentation should answer (not exhaustive):

lena-hinrichsen commented 2 years ago

Should https://github.com/OCR-D/zenhub/issues/72 (Proof of concept) be part of the documentation here or should that be somewhere else?

tdoan2010 commented 2 years ago

I think it should be part of the Reference Implementation, https://github.com/OCR-D/zenhub/issues/100

MehmedGIT commented 2 years ago

This post contains brief answers to the initially requested questions on Nextflow (NF). I will regularly edit/update this post as more feedback is obtained. I will also extend the answers depending on the further questions/needs. I'm reserving my right to answer with a documentation link when the topic to be covered requires more details. PS: I am not a Nextflow expert. I am myself using Nextflow for a few months and still learning.

1. What is NF and why did we choose it?

Nextflow is a workflow framework that allows the integration of various scripting languages into a single cohesive pipeline. Nextflow also has its own Domain Specific Language (DSL) that extends Groovy (extension of Java).

We choose it due to its rich set of features:

2. How is the NF script structured?

The NF script contains the following structures:

2.1 DSL and Parameters

2.2 Definition of processes

2.3 Definition of workflows

2.4 Main workflow

Check this source code example: seq_ocrd_wf_many.nf

TODO: I will provide more structure-related details here based on the example above.

3. Which features of NF do we use, i.e. what features have to be implemented in potential implementations?

The minimally used features for local runs are the parameters, processes, process decorators, and workflows. I will provide further answers to any following questions related to this main question. I am not sure what else to cover here for now.

4. How does parallelization work, both within works and across works?

A Nextflow workflow script contains several processes. Processes are executed independently and are isolated from each other (i.e. they do not have a shared memory space). Communication between the processes is possible only through data channels (similar to the pipes model in Unix). These channels are basically asynchronous FIFO queues. Any process can define one or more channels as input and output. The order of interaction between these processes, and ultimately the order of workflow execution depends on the communication channel dependencies between processes. For example, if process A writes data to channel A and process B reads data from channel A, then Nextflow knows that process A must be executed before process B.

Check this source code example: seq_ocrd_wf_many.nf

TODO: I will provide more parallelization details here based on the example above.

5. How does the NF script interact with the processing server?

There is still no running processing server. More details will be announced once there is more to talk about. The interaction will most probably happen with curl through a bash script inside the Nextflow process. Of course, if it is integrated inside the OCR-D core, then no direct interactions will be needed from inside the Nextflow script.

6. How does the NF script interact with the METS server?

There is still no running METS server. More details will be announced once there is more to talk about. The interaction will most probably happen with curl through a bash script inside the Nextflow process. Of course, if it is integrated inside the OCR-D core, then no direct interactions will be needed from inside the Nextflow script.

7. How to convert the existing OCR-D process workflows we reference to NF?

I have written an OtoN (OCR-D to Nextflow) converter which converts basic OCR-D process workflows to Nextflow workflow scripts. Check here: OtoN

Currently, there are no known issues or bugs. Feel free to report any bugs, errors, or lack of errors (when an error is expected). The tool will probably be a part of the OCR-D software in the future when it is stable enough for general use.

8. How should NF scripts be written, tested, deployed, and evaluated?

Depends on the use case. Detailed instructions for local executions and example Nextflow workflow scripts can be found here: Nextflow

I will provide further answers to any following questions related to this main question.

9. What conventions do we encourage, naming, structure, documentation, etc.?

Try to stick to the structure provided in point 2 when writing Nextflow scripts. You can also check the Nextflow examples provided in point 8. The naming conventions for variables, function names, process names, and workflow names are encouraged to follow the snake case. I will provide further answers to any following questions related to this main question.

kba commented 2 years ago

Superseded by #216