UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Some PNGs in gt/train are truncated #3

Open wollmers opened 4 years ago

wollmers commented 4 years ago

The line images in gt/train sometimes are too short, e. g.

ONB_ibn_19110701_018.tif_tl_6.gt.txt:###wertet. Das geringſte Gebot beträgt 10.020 Kronen.
ONB_ibn_19110701_018.tif_tl_63.gt.txt:b###chten Utenſilien, über welche die öffentliche Ver⸗
ONB_ibn_19110701_018.tif_tl_64.gt.txt:###ßerung ausgeſchrieben wird. Offerte bis 4. Juli.
ONB_ibn_19110701_018.tif_tl_69.gt.txt:lo###e der k. k. Forſt⸗ und Domänen ⸗ Direktion in
ONB_ibn_19110701_018.tif_tl_98.gt.txt:###war der Hausanteil auf 5538 Kronen und der Anteil

Just a reminder to explore this issue later.

stweil commented 4 years ago

I am afraid that most images with ### are unusable for training, because ### was obviously used for transcription of unreadable text. A small number of them could get a transcription from an experienced reader.

wollmers commented 4 years ago

Fixed ~150 of them as far as I can guess them, i. e. good language processing with dictionary lookup and word n-grams can solve it. Some result from poor knowledge of the transcribers in old Viennese vocabulary, e. g. Kloth (a special cotton fabric) or old geonames in the Austrian monarchy. If numbers or names of persons are unreadable there is no chance without additional context.

Hopefully I haven't overdone the fixes.