Open wollmers opened 4 years ago
The example also shows typical transcription errors, here in mistyping Fliehenden
as Fliekenden
. That's not good for training, of course.
We could run deskewing on all page images to find those with a large skew.
Maybe I fixed this particular transcription error in my clone.
I would first try to cut out the line images based on the bounding boxes in, calculate skew from them and deskew with ImageMagick {wich I know best and has a nice Perl interface). Or are the bounding boxes broken?
I'm afraid that in this case the bounding boxes are broken. The lines on the newpaper page have different skew angles. Even the text region with the bad line starts with lines without skew, but the last lines of the region have a rather large skew angle, so this is some kind of warping.
We could run deskewing on all page images to find those with a large skew.
Thinking about it, we should not change the original page images. They are the starting point, the task to be solved by programs using the GT files. Skew and warp is part of the OCR task.
Of course we can dewarp them in an intermediate step or place without changing the original images.
@stweil
Now I looked in the details.
Let's take again https://github.com/UB-Mannheim/AustrianNewspapers/blob/master/gt/train/ONB_aze_18950706_1/ONB_aze_18950706_1.jpg_tl_303.png as an example:
Using Tesseract for the page image and extracting the hOCR for the line:
<span class='ocr_line' id='line_1_323'
title="bbox 1417 3219 2053 3283; baseline 0.027 -39; x_size 36; x_descenders 6; x_ascenders 13">
Compare the bounding polygons (or boxes) between original ONB-Newseye (downloaded 2020-03-01), https://github.com/UB-Mannheim/AustrianNewspapers and Tesseract:
ONB_newseye:
<TextLine id="tl_303"
primaryLanguage="German"
custom="readingOrder {index:0;}">
<Coords points="1417,3200 2046,3200 2046,3246 1417,3246"/>
<Baseline points="1417,3241 1700,3251 2049,3262"/>
AustrianNewspapers:
<TextLine id="tl_303" primaryLanguage="German" custom="readingOrder {index:0;}">
<Coords points="1417,3200 2046,3200 2046,3246 1417,3246"/>
<Baseline points=" 1417,3241 1700,3251 2049,3262"/>
Comparison:
left top right bottom
newseye 1417 3200 2046 3246
austrian 1417 3200 2046 3246
tesseract title="bbox 1417 3219 2053 3283
Tesseract (mostly) uses min/max(x/y) of the character-bboxes for the line-bbox. Let's try a cutout with the bbox of Tesseract:
Now we have the complete line but still warped. Tesseract recognises this one line without errors. Now has better metrics:
<span class='ocr_line' id='line_1_2' title="bbox 2 2 637 50; baseline 0.039 -27; x_size 28; x_descenders 6; x_ascenders 6">
Taking the value 0.039 to calculate the rotation arctan(0.39) = 2.233°, we can rotate the image:
convert ONB_aze_18950706_1.jpg_tl_303.png -background none -shear 0,-2.2 shear_y-2.2.png
and crop it:
Now the result of Tesseract is nearly perfect:
<div class='ocr_page' id='page_1' title='image "shear_y-2.2_cropped.png"; bbox 0 0 638 31; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 1 0 636 31">
<p class='ocr_par' id='par_1_1' lang='deu' title="bbox 1 0 636 31">
<span class='ocr_line' id='line_1_1' title="bbox 1 0 636 31; baseline 0 -8; x_size 31; x_descenders 7; x_ascenders 8">
<span class='ocrx_word' id='word_1_1' title='bbox 1 2 39 31; x_wconf 96'>Mit</span>
<span class='ocrx_word' id='word_1_2' title='bbox 55 2 89 31; x_wconf 89'>den</span>
<span class='ocrx_word' id='word_1_3' title='bbox 93 1 217 31; x_wconf 81' lang='frk'>Fliehenden</span>
<span class='ocrx_word' id='word_1_4' title='bbox 231 0 365 30; x_wconf 96' lang='frk'>drangen</span>
<span class='ocrx_word' id='word_1_5' title='bbox 333 0 372 31; x_wconf 96' lang='frk'>wir</span>
<span class='ocrx_word' id='word_1_6' title='bbox 383 2 438 31; x_wconf 96' lang='frk'>durch</span>
<span class='ocrx_word' id='word_1_7' title='bbox 461 2 491 24; x_wconf 96' lang='frk'>die</span>
<span class='ocrx_word' id='word_1_8' title='bbox 505 0 569 30; x_wconf 95' lang='frk'>Thore</span>
<span class='ocrx_word' id='word_1_9' title='bbox 597 6 636 24; x_wconf 92'>von</span>
</span>
</p>
</div>
</div>
IMHO the polygons in the Page-XMLs are broken and can't be used as a base for cutting out lines in the best way. The segmentation algorithm maybe has some issues.
The Transkribus PAGE XML also contains a baseline for that line:
<Baseline points="1417,3241 1700,3251 2049,3262"/>
The Transkribus OCR process seems to take only that baseline information, ignoring the bounding box coordinates. Therefore many Transkribus users don't care for those box coordinates. That might be the reason why they are often wrong.
So ideally we should have a tool to extract line images based on the baselines.
austrian 1405 3189 2061 3365
I don't understand that line. Where does it come from? ONB_newseye and AustrianNewspaper should be identical (the PAGE XML was only reformatted in your latest commit).
austrian 1405 3189 2061 3365
I don't understand that line. Where does it come from? ONB_newseye and AustrianNewspaper should be identical (the PAGE XML was only reformatted in your latest commit).
Sorry, confused something during copy and paste. You are right, they are the same. Corrected it now:
left top right bottom
newseye 1417 3200 2046 3246
austrian 1417 3200 2046 3246
tesseract title="bbox 1417 3219 2053 3283
The Transkribus PAGE XML also contains a baseline for that line:
<Baseline points="1417,3241 1700,3251 2049,3262"/>
What exactly is this baseline semantically? If I add the descender = 7 (from Tesseract) to max(y) = 3262, I get 3269 and miss 3283 (max(y Tesseract bbox) - 3262 = 21 pixels = more than 50% of the line hight.
From https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd
<element name="Baseline" type="pc:BaselineType" minOccurs="0">
<annotation>
<documentation> Multiple connected points that mark the baseline of the glyphs </documentation>
</annotation>
</element>
In typography baseline is the lower edge of glyphs without descenders. Seems not usable from the XML in the above example. OCR can only estimate a baseline.
The Transkribus OCR process seems to take only that baseline information, ignoring the bounding box coordinates. Therefore many Transkribus users don't care for those box coordinates. That might be the reason why they are often wrong.
Never used Transkribus. Does not work on my Mac. So I can't tell, what's possible and what the transcribers should have done. Maybe build my own tool for segmentation correction. Web based.
So ideally we should have a tool to extract line images based on the baselines.
That's an interesting task. But I'm not sure, if it will work well automatically. At least I will do it, but not this week. Having good line images will improve the semi-automatic text corrections.
Some line images are more or less skewed containing fragments of the line before or after.
E. g. ONB_aze_18950706_1.jpg_tl_303.png
In the above image of
TextLine id="tl_303"
the TextMit den Fliekenden drangen wir durch die Thore von
isn't even represented complete.
So I would suggest to cut them out in a better way with improved image processing.