bertsky / workflow-configuration

a makefilization for OCR-D workflows, with configuration examples
Apache License 2.0
9 stars 5 forks source link

chained workflows #25

Open bertsky opened 1 year ago

bertsky commented 1 year ago

It would help if workflows can be chained at runtime, e.g. ocrd-make -f pre3.mk -f seg1.mk -f ocr4.mk -f post.mk, where each makefile would consume the last fileGrp of the previous – so each stage can be replaced by an alternative configuration independent of the others. This in turn would allow writing very concise small (sub-)configurations without repetition.

As for implementation, make allows passing multiple makefiles and reads them sequentially (w.r.t. first phase, i.e. expansion of immediate variables etc.), then combines them (second phase) and finally computes dependencies.

So we could by convention (for chainable configurations) allow defining a simply expanded variable (say) OUTPUT for the (phase's) output fileGrp name, and allow using INPUT for the (phase's) dynamic input fileGrp name. Internally then (i.e. in our Makefile that always needs to be included), we predefine INPUT := $(or $(OUTPUT),$(INPUT)) and .DEFAULT_GOAL := $(OUTPUT). For the very first phase (entry point), we then just have to pass INPUT – either in a separate (phase zero) non-rule config file or with an additional cmdline arg.

For example

DESK: BIN DESK: TOOL = ocrd-cis-ocropy-deskew DESK: PARAMS = "level-of-operation": "page"

CROP: DESK CROP: TOOL = ocrd-anybaseocr-crop CROP: PARAMS = "rulerAreaMax": 0

OUTPUT := CROP

* seg1.mk
```make
SEG: $(INPUT)
SEG: TOOL = ocrd-kraken-segment
SEG: PARAMS = "model": "blla.mlmodel"

RESEG: SEG
RESEG: TOOL = ocrd-cis-ocropy-resegment
RESEG: PARAMS = "method": "baseline"

OUTPUT := RESEG

OCR1: TOOL = ocrd-tesserocr-recognize OCR1: OPTIONS += -P model frak2021+deu

OCR2: TOOL = ocrd-calamari-recognize OCR2: OPTIONS += -P checkpoint_dir qurator-gt4histocr-1.0

OCR3: TOOL = ocrd-kraken-recognize OCR3: OPTIONS += -P model austriannewspapers.mlmodel

MULTI: OCR1 OCR2 OCR3 MULTI: TOOL = ocrd-cor-asv-ann-align MULTI: PARAMS = "method": "combined"

OUTPUT := MULTI

* post.mk
```make
ALTO: $(INPUT)
ALTO: TOOL = ocrd-fileformat-transform
ALTO: OPTIONS = -P from-to "page alto" -P script-args "--no-check-border --dummy-word"

OUTPUT := ALTO

Since this only requires these 2 additional lines and does not break existing makefiles, this is more of a documentation issue actually. (And probably, the old makefiles should be removed or updated or split into multi-stage configurations anyway.)

@mikegerber would that fit your need as well?