Open stweil opened 4 years ago
A simple workaround could ignore all line images with an image height larger than the image width. Images with a small height should not be ignored (otherwise line numbers like I
or 1
might not be trained).
Sounds challenging.
An improved heuristic under the assumption of mostly correct transcriptions could estimate the line proportions from font metrics. But it would not work for nearly quadratic images.
Just looked into the page xml of the above example ONB_ibn_19110701_010.xml
:
<TextLine id="line_1547100913156_36" custom="readingOrder {index:2;}">
<Coords points="2307,2829 2320,2628 2370,2631 2357,2832"/>
<Baseline points="2352,2832 2365,2631"/>
<TextEquiv>
<Unicode>Celsiusgraden</Unicode>
</TextEquiv>
</TextLine>
The Baseline
tells us:
x1 - x2 = 2352 - 2365 = -13
y1 - y2 = 2832 - 2631 = 201
We can calculate the skew (and orientation) from it or just ocr the line image with tesseract:
<p class='ocr_par' id='par_1_1' lang='ubma/frak2021_0.905_1587027_9141630' title="bbox 0 5 62 201">
<span class='ocr_line' id='line_1_1' title="bbox 27 5 62 201; baseline -65.333 522.667; x_size 34; x_descenders 8; x_ascenders 8">
<span class='ocrx_word' id='word_1_1' title='bbox 27 5 62 201; x_wconf 0'>vabsnni</span>
</span>
<span class='ocr_line' id='line_1_2' title="bbox 0 52 20 142; baseline -45 0; x_size 35.5; x_descenders 8.5; x_ascenders 8.5">
<span class='ocrx_word' id='word_1_2' title='bbox 2 52 20 79; x_wconf 40'>11</span>
<span class='ocrx_word' id='word_1_3' title='bbox 0 100 19 142; x_wconf 27'>an!</span>
</span>
</p>
This gives
baseline -65.333 => -89.1230877852261 ~ -90 degrees clockwise
baseline -45 => -88.7269699799433 ~ -90 degrees clockwise
Result of Tesseract on the rotated image (CER 0.0):
<p class='ocr_par' id='par_1_1' lang='ubma/frak2021_0.905_1587027_9141630' title="bbox 3 0 199 62">
<span class='ocr_line' id='line_1_1' title="bbox 62 0 152 20; baseline 0.011 -1; x_size 35.5; x_descenders 8.5; x_ascenders 8.5">
<span class='ocrx_word' id='word_1_1' title='bbox 62 0 104 19; x_wconf 34'>ur</span>
<span class='ocrx_word' id='word_1_2' title='bbox 125 2 152 20; x_wconf 82'>in</span>
</span>
<span class='ocr_line' id='line_1_2' title="bbox 3 27 199 62; baseline 0.015 -9; x_size 34; x_descenders 8; x_ascenders 8">
<span class='ocrx_word' id='word_1_3' title='bbox 3 27 199 62; x_wconf 86'>Celſiusgraden</span>
</span>
</p>
Now we can cut out the image of the most similar line and update Page-XML (keeping the semantics of <Baseline points="2352,2832 2365,2631" />
).
Just for information:
There are 273 TextLine
entries in the XML files, where the skew of the Baseline
is larger than (+/-) 10 degrees. Some of them have already rotated line images. Some of the skews are not a multiple of 90 degrees, mainly in advertisements
@JKamlah, the latest update now has better baselines and bounding coordinates, but still no indicator whether some text is written vertically or otherwise rotated. I am not sure whether PAGE XML has a special indicator for rotated text or whether it only relies on the baseline information. Here is an example of a baseline for vertical text: <Baseline points="2761,4416 2764,4078 2761,3869"/>
. Maybe that baseline could be simplified by removing the 2nd point.
What should we do with textlines without a baseline? Such textlines exist in the latest PAGE XML.
@stweil there are two attributes for the TextRegionType in the PAGE XML format: orientation and readingOrientation. I will see if it is possible to add some information about the text rotation in Transkribus. At least for vertically stacked text the baseline points will not be sufficient without further information.
There should be baseline information for every textline. We will fix this immediately.
The baselines should now all be in place: 51fd52e51e4eca65c7ed32114aecdec3f3c1fb1c
A certain number of pages contains text written vertically. This is typically used in head rows of tables.
Example page: ONB_ibn_19110701_010.tif
The corresponding line boxes are not rotated, so also contain text written vertically. They are not suitable for training or evaluation. In addition the sample line image contains two text lines.