krvoigt commented 2 years ago

Current situation

The benchmarking concept is the basis for the Workflow tab of the kwalite dashboard.

Main user story for workflow tab: As an OCR-D user, I would like to be able to quickly check which workflow is most suitable for my data (criteria: publication date, quality, font, layout, pages) in order to be able to select the appropriate workflow (then download and process it at my end).

We want to be able to measure and compare different works with different workflows in order to decide which workflows work best for which material. Currently we only have recommendations but we don't have concrete numbers to back up our recommendations. These measured values can also serve as a basis for further development and optimization.

How it should be

As a OCR-D developer I need to gather understanding – what is it about, where do we want to go, what do we need? – for further conception.

Steps

[x] basic understanding what we want and why
[x] create basic concept for benchmarking
[x] discuss this concept with team members (first)
[x] Cast ideas/first concepts into pseudocode under make benchmarks

Prior Art

Pad WIP https://pad.gwdg.de/itXgt0gdR-y1kWq8QuCvqg?both

mweidling commented 2 years ago

Concept for benchmarking / data for the workflow tab

Data

In a discussion we identified the following properties of a workspace as crucial: publication date, font, layout, pages.

Publication date: The idea is to provide data sets from all VD periods as well as modern texts to cover most of our known use cases
fonts: Antiqua and black letter are the most common for the VD periods. It would be beneficial to have some Greek and maybe even Hebrew examples, though, as Greek and Hebrew are "holy languages" and might therefore be quoted in some texts.
layout: Layout consideration should encompass title pages (that have a lot of decoration, capitals, etc.), multi columned pages, tables, binding pages, vacant pages, maps and sheet music
pages: the number of pages should range from 1 (leaflet) to 150 or even 300 page(s) (full monograph) to get an impression about a workflow's performance

Metadata for data sets

In the JSON representation of a workflow the data should be tagged to enable easy sorting/filtering
each data set needs a thorough description so that users can compare their own data to the sample data sets used for a workflow

Next steps for data

[ ] create data sets we want to use for workflows (re-using the ones I got from @kba)
[ ] create tags for them
[ ] (write a first description) (this step is important for the application's usability, but not pressing at this point)

Ground Truths

:question: At this point I'm not sure if we simple use an existing GT or if we create ones ourselves.

Workflows

The main idea of the workflow tab is to enable OCR-D users to identify suitable workflows for their data (where suitability means CER/WER and/or performance of the workflow). Since we have a lot of processors, it's not feasible to perform a simple permutation of all processors for all data sets. A good starting point might be to use the findings and recommendations the KIT had in the second project phase combined with examples obtained from people using OCR-D on a daily basis (Maria?).

The first evaluation of the workflow results could be done with dinglehopper, which is suitable for simple text evaluation.

Next steps for workflows

[ ] ~re-do the evaluation done by the KIT with newer processor versions and check if CER/WER and/or performance changed~ (this doesn't seem feasible)
[ ] also consider newer processors in the evaluation
[ ] get in contact with Maria to talk about the workflows used on a day to day basis

Getting the data relevant for the front end

JSON Output

The dashboard should be fed with JSON containing all relevant information. A first draft of the data looks like this:

[
    {
        "workflow-id": "1",
        "ocrd-workspace": "https://some-url-pointing-to.a/mets.xml",
        "properties":
            {
                "font": "antiqua",
                "date-of-creation": "19. century",
                "no-of-pages": "100",
                "layout": "simple"
            },
        "workflow-metrics": "https://link-to-nextflow-results.com",
        "cer_total": "5.7",
        "cer_per_page": "0.92",
        "time_per_page_in_seconds": "15"        
    }
]

… and how to get it

In order to get a better understanding of how this is done, I will probably have to have a look at Nextflow and Mehmed's findings first.

mweidling commented 2 years ago

See also Mehmed's work in core: https://github.com/ocr-d/core/issues/883

mweidling commented 2 years ago

It is possible to retrieve a JSON output from Nextflow, see docs. This could possibly be leveraged for our front end to display the performance details of each workflow.

krvoigt commented 2 years ago

see Spike 2: https://github.com/OCR-D/zenhub/issues/123

OCR-D / zenhub

Benchmarking Spike 1 (Basis for Kwalite Dashboard Workflow Tab) #107

Concept for benchmarking / data for the workflow tab

Data

Metadata for data sets

Next steps for data

Ground Truths

Workflows

Next steps for workflows

Getting the data relevant for the front end

JSON Output

… and how to get it