UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

J/I transcription in Fraktur #34

Closed wollmers closed 3 years ago

wollmers commented 3 years ago

At the moment all Fraktur J before a vocal are transcribed as I in the text files.

I am not sure if it harms the quality of training. Maybe not, because training takes the adjacent characters into account.

On the other hand I myself prefer J in the image as J in the result. Most Blackletter fonts did not have an I. After ~1900 they began to cut I in some (~25 %) Blackletter fonts. The difference can only be seen if both appear in the same text.

AFAIK GT4Hist keeps J. It can be a problem if GT4Hist is combined with AustrianNewspapers.

Quick proof:

$ grep -R --exclude *.png 'J[bcdfghjklmnpqrsſtvwxz]' .
ONB_ibn_19110701_037.tif_tl_13.gt.txt:Seit Ende Mai ds. Js. ſind ſowohl die Zugänge als als auch die
ONB_ibn_18640702_003.tif_tl_13.gt.txt:machung : In Folge hoher k. k. Statthalterei⸗Kundmachung vom 31. Mai d. Js.
ONB_ibn_18640702_003.tif_tl_16.gt.txt:d. Js. beginnt, und es haben ſich die aus dem Civilſtande Eintretenden mit dem
ONB_ibn_18640702_012.tif_tl_38.gt.txt:d. Js. um ſo gewiſſer anher einzuzahlen, als ſonſt nach Ablauf dieſer Friſt die
ONB_ibn_18640702_012.tif_tl_16.gt.txt:Am 4. Juli d. Js. um 9 Uhr früh angefangen, werden im Hauſe Nr. 57
ONB_ibn_19110701_035.tif_tl_174.gt.txt:Kufſtein (Tirol) iſt mit 1. September ds. Js.
ONB_ibn_19110701_027.tif_tl_7.gt.txt:Das Schuljahr 1911|12 beginnt am 16. September ds. Js. Die Schüleraufnahme
ONB_ibn_18640702_009.tif_tl_17.gt.txt:Pachtliebhaber wollen ſich bis Jakobi d. Js. bei der gräfl. v. Enzenberg⸗
ONB_ibn_18640702_009.tif_tl_12.gt.txt:Auf Martini d. Js. kommt zu verpachten:

$ grep -R --exclude *.png 'I[bcdefghjklmnpqrsſtvwxz]' . | wc -l
    2536
stweil commented 3 years ago

At the moment all Fraktur J before a vocal are transcribed as I in the text files.

Where did you find such patterns? I only found a few "Ia".

AFAIK GT4Hist keeps J.

Yes, the original GT4HistOCR normalized all upper case I to J. That not only looks strange for Roman numerals, but also is wrong for some texts which don't use Fraktur fonts.

Therefore I started to restore upper case I and added it also for Fraktur texts, even when there was no visible difference to J.

The resulting models then normally recognize I and J, so a full text search can look for "Juli" and "Indien".

wollmers commented 3 years ago

At the moment all Fraktur J before a vocal are transcribed as I in the text files.

Where did you find such patterns? I only found a few "Ia".

Sorry, I meant consonants. The "Ia" are correct and stand for "1a" with a Roman numeral.

AFAIK GT4Hist keeps J.

Yes, the original GT4HistOCR normalized all upper case I to J. That not only looks strange for Roman numerals, but also is wrong for some texts which don't use Fraktur fonts.

AFAIK J or j isn't used in Roman numerals in this corpus. That was usual sometimes before 1700.

Therefore I started to restore upper case I and added it also for Fraktur texts, even when there was no visible difference to J.

The resulting models then normally recognize I and J, so a full text search can look for "Juli" and "Indien".

OK, then I keep them like this:

Bildschirmfoto 2021-07-24 um 23 42 19

stweil commented 3 years ago

AFAIK J or j isn't used in Roman numerals in this corpus.

It was heavily used in 19th century texts from DTA, but there is also an example from 1588. See https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/-/commit/fbe5f5c127b3dbd3bfef2924c643d5a46bb8e725

wollmers commented 3 years ago

OK, DTA is wrong (got a long list):

GT4HistOCR/corpus/dta19/1854-candidus_christus/00104.gt.txt:— VJJJ —

00104 nrm

2 errors: no em-dash in the png, and no JJJ.

DTA has also some books of Blumenbach in their corpus, which came from Blumenbach-Online transcribed with round s, current alphabet, but old orthography. But they are not in the GT4HistOCR selection. At least I need to build my own GT data to cover the domain of natural history 1750-1900.

I mean something like this with jjj (1481 by William Caxton):

Bildschirmfoto 2021-06-18 um 06 10 47