OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Create corpora for benchmarking #130

Open mweidling opened 2 years ago

mweidling commented 2 years ago

In order to execute the benchmarking we need some data with different characteristics to work on. @mweidling already has examined the OCR-D GT repository and wants to discuss with @tboenig and @cneud about useful corpora.

TODOs:

mweidling commented 2 years ago

This is a first naive overview of my GT categorization: gt_overview.ods.

If this isn't of that much use, I'll have a deeper look into that.

mweidling commented 2 years ago

Here ist second, reviewed version of the sheet:

gt_overview.ods

EDIT: Replaced all instances of schwabacher with fraktur (20.09.22).

mweidling commented 2 years ago

First draft for the corpora

General thoughts

Categories

16th century, fraktur, simple layout

16th century, fraktur, complex layout

16th century, antiqua, simple layout

16th century, antiqua, complex layout

16th century, font mix, simple layout

16th century, font mix, complex layout


17th century, fraktur, simple layout

17th century, fraktur, complex layout

17th century, antiqua, simple layout

17th century, antiqua, complex layout

17th century, font mix, simple layout

fraktur, antiqua

fraktur, antiqua, ancient Greek, Hebrew

17th century, font mix, complex layout

fraktur, antiqua


18th century, fraktur, simple layout

18th century, fraktur, complex layout

18th century, antiqua, simple layout

18th century, antiqua, complex layout

18th century, font mix, simple layout

18th century, font mix, complex layout


19th century, antiqua [1]

19th century, fraktur [1]

[1] We only have two works with text GT for the 19th century, blumenbach_anatomie_1805.ocrd and arnimb_goethe03_1835.ocrd. Since the 19th century isn't part of our scope, we'll limit ourselves to the material we already have.

mweidling commented 2 years ago

Creating the simple cases

Categories

16th century, fraktur, simple layout

16th century, antiqua, simple layout

16th century, antiqua, complex layout


17th century, fraktur, simple layout

17th century, font mix, simple layout

fraktur, antiqua

fraktur, antiqua, ancient Greek, Hebrew


18th century, fraktur, simple layout

18th century, antiqua, simple layout

18th century, font mix, complex layout


19th century, antiqua

19th century, fraktur

mweidling commented 2 years ago

The data is now available at https://github.com/OCR-D/quiver-data.git.