UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Improve line images #38

Open wollmers opened 3 years ago

wollmers commented 3 years ago

As already mentioned in issues #29 #28 #3 and #2 there are problems with the line images as they don't contain 1:1 the text of the corresponding *.gt.txt files.

Problems are:

In short words: line segmentation should be improved.

To do it automatically the usual methods of a "best practice" OCR-workflow should be used, without repeating the manual step of page segmentation into regions. The regions in the Page-XML seem ok.

Rotate by multiples of 90 degrees

Using the Baseline tag of Page-XML there are 225 lines rotated within +/- 10 degrees around 90, 180 or 270 degrees, most within +/- 2 degrees. Rotation by exactly 90, 180 or 270 degrees would be lossless.

Deskew within 10 degrees

10,603 of ~57,000 images have a skew between 0.5 and 10 degrees. The majority is skewed within +/- 2 degrees. It would be better to calculate the lower threshold in pixels (0.5 or 1.0), because it makes no sense to rotate a difference less than 1 pixel at the left or right corner of a line. Would maybe need some tests of a larger amount of lines, because good image manipulation programs like ImageMagick work internally on subpixel level. Also Tesseract has better accuracy with exactly deskewed lines.

Baseline in XML isn't always reliable.

Looking at some samples between 2 and 10 degrees they seem to be short sequences with characters at different vertical positions, which maybe are also skewed within 2 degrees. Thus it would be better to estimate the average skew of each TextRegion first, and take this under consideration.

Deskew between 10 and 80 degrees

There are ~50 images in this range. Some are diagonal labels in tables. Some are more like illustrations as part of advertisements. This sort of advertisements is a special issue without an easy solution (It's an extra issue).

Segment regions into lines

Maybe it' a better approach to create region images first of pure text regions and use ocropy for segmentation and dewarp. The advantages are that ocropy uses masks and removes speckles outside the mask, also adds white pixels at the borders. Disadvantages are binary images (no color), position info is lost, and ocropus-dewarp scales the images down (would need a closer look into the source code to maybe find a better solution). Also ocropy does not ignore some noise at begin and end of the line. This can be cut away using the GT texts.

BTW adding white pixels around the borders of the line images in GT4Hist would maybe be an improvement.

Each of the steps improves recognition accuracy of degraded images a few percent. Should GT images for training be "too good"? IMHO it's not cheating as long as it's done with available tools of a modern OCR workflow. The remaining image quality will be noisy enough.

Text of large hight (e.g. titles)

They are all truncated at the top. TextRegion seems ok, but the polygon of TextLine has a wrong hight. This allows to cut out a region image, if the region contains only one line and segment it.

Update information in Page-XML

In case of skew and warp it makes not sense and isn't easy to update the polygons. For large titles it makes sense.

Also fontFamily and fontSize

<TextStyle fontFamily="Times New Roman" fontSize="4.5"/>

could be updated with a better guess. fontSize (should be in points according to the Page-XML specification) needs reliable information of dpi during scanning or the original paper format. For fontFamily it's questionable. It could be classified as "'Times New Roman', serif" by default and changed in cases of sans-serif, Fraktur, Textura or Gothic if automatically classified.

wollmers commented 3 years ago

Just a note:

The dpi of the page images does not correspond to the original paper size but to ~A4. This is only correct for "Kronenzeitung", but AZ and NFP have 270 x 425 mm. Taking this into account the values for fontSize are plausible if defined. But that's only important for training of font identification or if someone wants to display the text on a webpage in a similar size.