UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Deskew the line images? #2

Open wollmers opened 4 years ago

wollmers commented 4 years ago

Some line images are more or less skewed containing fragments of the line before or after.

E. g. ONB_aze_18950706_1.jpg_tl_303.png

image

In the above image of TextLine id="tl_303" the Text

Mit den Fliekenden drangen wir durch die Thore von

isn't even represented complete.

So I would suggest to cut them out in a better way with improved image processing.

stweil commented 4 years ago

The example also shows typical transcription errors, here in mistyping Fliehenden as Fliekenden. That's not good for training, of course.

stweil commented 4 years ago

We could run deskewing on all page images to find those with a large skew.

wollmers commented 4 years ago

Maybe I fixed this particular transcription error in my clone.

I would first try to cut out the line images based on the bounding boxes in, calculate skew from them and deskew with ImageMagick {wich I know best and has a nice Perl interface). Or are the bounding boxes broken?

stweil commented 4 years ago

I'm afraid that in this case the bounding boxes are broken. The lines on the newpaper page have different skew angles. Even the text region with the bad line starts with lines without skew, but the last lines of the region have a rather large skew angle, so this is some kind of warping.

wollmers commented 4 years ago

We could run deskewing on all page images to find those with a large skew.

Thinking about it, we should not change the original page images. They are the starting point, the task to be solved by programs using the GT files. Skew and warp is part of the OCR task.

Of course we can dewarp them in an intermediate step or place without changing the original images.

wollmers commented 4 years ago

@stweil

Now I looked in the details.

Let's take again https://github.com/UB-Mannheim/AustrianNewspapers/blob/master/gt/train/ONB_aze_18950706_1/ONB_aze_18950706_1.jpg_tl_303.png as an example:

ONB_aze_18950706_1 jpg_tl_303 gt

Using Tesseract for the page image and extracting the hOCR for the line:

<span class='ocr_line' id='line_1_323' 
     title="bbox 1417 3219 2053  3283; baseline 0.027 -39; x_size 36; x_descenders 6; x_ascenders 13">

Compare the bounding polygons (or boxes) between original ONB-Newseye (downloaded 2020-03-01), https://github.com/UB-Mannheim/AustrianNewspapers and Tesseract:

ONB_newseye:

<TextLine id="tl_303"
       primaryLanguage="German"
       custom="readingOrder {index:0;}">
  <Coords points="1417,3200 2046,3200 2046,3246 1417,3246"/>
  <Baseline points="1417,3241 1700,3251 2049,3262"/>

AustrianNewspapers:

<TextLine id="tl_303" primaryLanguage="German" custom="readingOrder {index:0;}">
  <Coords points="1417,3200  2046,3200  2046,3246 1417,3246"/>
  <Baseline points="  1417,3241 1700,3251  2049,3262"/>

Comparison:

                           left  top  right bottom
newseye                    1417  3200 2046  3246
austrian                   1417  3200 2046  3246
tesseract title="bbox      1417  3219 2053  3283

Tesseract (mostly) uses min/max(x/y) of the character-bboxes for the line-bbox. Let's try a cutout with the bbox of Tesseract:

ONB_aze_18950706_1 jpg_tl_303

Now we have the complete line but still warped. Tesseract recognises this one line without errors. Now has better metrics:

<span class='ocr_line' id='line_1_2' title="bbox 2 2 637 50; baseline 0.039 -27; x_size 28; x_descenders 6; x_ascenders 6">

Taking the value 0.039 to calculate the rotation arctan(0.39) = 2.233°, we can rotate the image:

convert ONB_aze_18950706_1.jpg_tl_303.png  -background none  -shear 0,-2.2  shear_y-2.2.png

shear_y-2 2

and crop it:

shear_y-2 2_cropped

Now the result of Tesseract is nearly perfect:

  <div class='ocr_page' id='page_1' title='image "shear_y-2.2_cropped.png"; bbox 0 0 638 31; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 1 0 636 31">
    <p class='ocr_par' id='par_1_1' lang='deu' title="bbox 1 0 636 31">
     <span class='ocr_line' id='line_1_1' title="bbox 1 0 636 31; baseline 0 -8; x_size 31; x_descenders 7; x_ascenders 8">
      <span class='ocrx_word' id='word_1_1' title='bbox 1 2 39 31; x_wconf 96'>Mit</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 55 2 89 31; x_wconf 89'>den</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 93 1 217 31; x_wconf 81' lang='frk'>Fliehenden</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 231 0 365 30; x_wconf 96' lang='frk'>drangen</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 333 0 372 31; x_wconf 96' lang='frk'>wir</span>
      <span class='ocrx_word' id='word_1_6' title='bbox 383 2 438 31; x_wconf 96' lang='frk'>durch</span>
      <span class='ocrx_word' id='word_1_7' title='bbox 461 2 491 24; x_wconf 96' lang='frk'>die</span>
      <span class='ocrx_word' id='word_1_8' title='bbox 505 0 569 30; x_wconf 95' lang='frk'>Thore</span>
      <span class='ocrx_word' id='word_1_9' title='bbox 597 6 636 24; x_wconf 92'>von</span>
     </span>
    </p>
   </div>
  </div>

IMHO the polygons in the Page-XMLs are broken and can't be used as a base for cutting out lines in the best way. The segmentation algorithm maybe has some issues.

stweil commented 4 years ago

The Transkribus PAGE XML also contains a baseline for that line:

<Baseline points="1417,3241 1700,3251 2049,3262"/>

The Transkribus OCR process seems to take only that baseline information, ignoring the bounding box coordinates. Therefore many Transkribus users don't care for those box coordinates. That might be the reason why they are often wrong.

So ideally we should have a tool to extract line images based on the baselines.

stweil commented 4 years ago

austrian 1405 3189 2061 3365

I don't understand that line. Where does it come from? ONB_newseye and AustrianNewspaper should be identical (the PAGE XML was only reformatted in your latest commit).

wollmers commented 4 years ago

austrian 1405 3189 2061 3365

I don't understand that line. Where does it come from? ONB_newseye and AustrianNewspaper should be identical (the PAGE XML was only reformatted in your latest commit).

Sorry, confused something during copy and paste. You are right, they are the same. Corrected it now:

                           left  top  right bottom
newseye                    1417  3200 2046  3246
austrian                   1417  3200 2046  3246
tesseract title="bbox      1417  3219 2053  3283
wollmers commented 4 years ago

The Transkribus PAGE XML also contains a baseline for that line:

<Baseline points="1417,3241 1700,3251 2049,3262"/>

What exactly is this baseline semantically? If I add the descender = 7 (from Tesseract) to max(y) = 3262, I get 3269 and miss 3283 (max(y Tesseract bbox) - 3262 = 21 pixels = more than 50% of the line hight.

From https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd

<element name="Baseline" type="pc:BaselineType" minOccurs="0">
  <annotation>
    <documentation> Multiple connected points that mark the baseline of the glyphs </documentation>
  </annotation>
</element>

In typography baseline is the lower edge of glyphs without descenders. Seems not usable from the XML in the above example. OCR can only estimate a baseline.

The Transkribus OCR process seems to take only that baseline information, ignoring the bounding box coordinates. Therefore many Transkribus users don't care for those box coordinates. That might be the reason why they are often wrong.

Never used Transkribus. Does not work on my Mac. So I can't tell, what's possible and what the transcribers should have done. Maybe build my own tool for segmentation correction. Web based.

So ideally we should have a tool to extract line images based on the baselines.

That's an interesting task. But I'm not sure, if it will work well automatically. At least I will do it, but not this week. Having good line images will improve the semi-automatic text corrections.