OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Benchmarking Spike 1 (Basis for Kwalite Dashboard Workflow Tab) #107

Closed krvoigt closed 2 years ago

krvoigt commented 2 years ago

Current situation

The benchmarking concept is the basis for the Workflow tab of the kwalite dashboard.

Main user story for workflow tab: As an OCR-D user, I would like to be able to quickly check which workflow is most suitable for my data (criteria: publication date, quality, font, layout, pages) in order to be able to select the appropriate workflow (then download and process it at my end).

We want to be able to measure and compare different works with different workflows in order to decide which workflows work best for which material. Currently we only have recommendations but we don't have concrete numbers to back up our recommendations. These measured values can also serve as a basis for further development and optimization.

How it should be

As a OCR-D developer I need to gather understanding – what is it about, where do we want to go, what do we need? – for further conception.

Steps

Prior Art

Pad WIP https://pad.gwdg.de/itXgt0gdR-y1kWq8QuCvqg?both

mweidling commented 2 years ago

Concept for benchmarking / data for the workflow tab

Data

In a discussion we identified the following properties of a workspace as crucial: publication date, font, layout, pages.

Metadata for data sets

Next steps for data

Ground Truths

:question: At this point I'm not sure if we simple use an existing GT or if we create ones ourselves.

Workflows

The main idea of the workflow tab is to enable OCR-D users to identify suitable workflows for their data (where suitability means CER/WER and/or performance of the workflow). Since we have a lot of processors, it's not feasible to perform a simple permutation of all processors for all data sets. A good starting point might be to use the findings and recommendations the KIT had in the second project phase combined with examples obtained from people using OCR-D on a daily basis (Maria?).

The first evaluation of the workflow results could be done with dinglehopper, which is suitable for simple text evaluation.

Next steps for workflows

Getting the data relevant for the front end

JSON Output

The dashboard should be fed with JSON containing all relevant information. A first draft of the data looks like this:

[
    {
        "workflow-id": "1",
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "workflow-metrics": "https://link-to-nextflow-results.com",
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": "15"        
    }
]

… and how to get it

In order to get a better understanding of how this is done, I will probably have to have a look at Nextflow and Mehmed's findings first.

mweidling commented 2 years ago

See also Mehmed's work in core: https://github.com/ocr-d/core/issues/883

mweidling commented 2 years ago

It is possible to retrieve a JSON output from Nextflow, see docs. This could possibly be leveraged for our front end to display the performance details of each workflow.

krvoigt commented 2 years ago

see Spike 2: https://github.com/OCR-D/zenhub/issues/123