UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Info: current statistics #22

Open wollmers opened 4 years ago

wollmers commented 4 years ago

Compared original XML "ONB_newseye" to current line texts "AustrianNewspapers".

compare_xml.pl Version 0.01

Compare XML text output against ground truth (GRT):
XML: ONB_newseye
GRT: AustrianNewspapers


              lines   words   chars
items ocr:    57541  326524 2198240 matches + inserts + substitutions
items grt:    57541  326394 2198051 matches + deletions + substitutions
matches:      23961  265356 2125325 matches
edits:        33580   61346   73806 inserts + deletions + substitutions
 subss:       33580   60860   71835 substitutions
 inserts:         0     308    1080 inserts
 deletions:       0     178     891 deletions
precision:   0.4164  0.8127  0.9668 matches / (matches + substitutions + inserts)
recall:      0.4164  0.8130  0.9669 matches / (matches + substitutions + deletions)
accuracy:    0.4164  0.8122  0.9664 matches / (matches + substitutions + inserts + deletions)
f-score:     0.4164  0.8128  0.9669 ( 2 * recall * precision ) / (recall + precision )

Shortened list of the edits/mismatches:

Character match (confusion) table:
GRT => OCR  ratio  errors   count
---    --- ------ ------- -------
'ſ' => 's' 0.9985   56885   56971
'⸗' => '-' 0.0052      61   11639
'⸗' => '=' 0.3232    3762   11639
'⸗' => '¬' 0.6691    7788   11639
SUM                 68496
+ transcription      1000   estimated transcription level 1 -> 2
TOTAL transcription 69496

edits               73806
- transcription    -69496
corrections          4310  (0,20% of all characters)

Rough guess of errors still in the GRT: 1000 - 2000.