UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

[GT Checked] Files: 1718. Fixed s,f and ſ confusions and some … #10

Closed JKamlah closed 4 years ago

JKamlah commented 4 years ago

…minor write-off errros.

wollmers commented 4 years ago

@JKamlah I merged #10 into my clone https://github.com/wollmers/AustrianNewspapers and resolved the conflicts with https://github.com/wollmers/AustrianNewspapers/commit/a23d3cc5fd4498e82ba27add46386a8f89180e1c proofreading ONB_aze_18950706 pages 1, 2.

JKamlah commented 4 years ago

Hello @wollmers, thank you for the contribution. Yes we used a mostly automatic approach, which is still in development, and so it seems that not all cases get detected. If you need a tool to support the proofreading https://github.com/UB-Mannheim/GTCheck might be of interest. The only problem atm is that it only shows modified files, so maybe a workaround could be that you put the data into a new git repo, make the correction and than move the files back.

wollmers commented 4 years ago

@JKamlah thanks for your hints. https://github.com/UB-Mannheim/GTCheck is a nice approach with some interesting ideas.

I now developed my own proofreading web-application displaying also the image of the line before and after for the proofreading context. The other proofreader I developed is pure JavaScript, works only on hOCR and displays the whole page, which is inconvenient for large pages like newspapers. Still getting experience with usability and speed of proofreading.

My focus is reconstruction of books in the domain of Natural History. The side-step to newspapers is interesting to see harder problems. My approach is improving the optical recognition rate by font recognition and unsupervised learning of fonts and unknown characters as well. Same for vocabulary as ~30% of words and names in scientific books are unknown (not in my existing dictionaries, ~20 M words, 2.4 M common German).

I proofread only 2 pages manually to learn the problems. Of course proofreading detects other errors than automatic checks. Automatic spellchecks have false positives on single word level, e.g. "seid" versus "seit".

wollmers commented 4 years ago

@JKamlah Something like this is hard to correct automatically and also manually without more context than 3 lines. From the paragraph context I would guess "Angeb.[ote]" instead of "Anged."

Bildschirmfoto 2020-07-03 um 11 57 51

stweil commented 4 years ago

Thanks, merged now.