Open wollmers opened 3 years ago
Just a note:
The dpi
of the page images does not correspond to the original paper size but to ~A4. This is only correct for "Kronenzeitung", but AZ and NFP have 270 x 425 mm. Taking this into account the values for fontSize
are plausible if defined. But that's only important for training of font identification or if someone wants to display the text on a webpage in a similar size.
As already mentioned in issues #29 #28 #3 and #2 there are problems with the line images as they don't contain 1:1 the text of the corresponding
*.gt.txt
files.Problems are:
In short words: line segmentation should be improved.
To do it automatically the usual methods of a "best practice" OCR-workflow should be used, without repeating the manual step of page segmentation into regions. The regions in the Page-XML seem ok.
Rotate by multiples of 90 degrees
Using the
Baseline
tag of Page-XML there are 225 lines rotated within +/- 10 degrees around 90, 180 or 270 degrees, most within +/- 2 degrees. Rotation by exactly 90, 180 or 270 degrees would be lossless.Deskew within 10 degrees
10,603 of ~57,000 images have a skew between 0.5 and 10 degrees. The majority is skewed within +/- 2 degrees. It would be better to calculate the lower threshold in pixels (0.5 or 1.0), because it makes no sense to rotate a difference less than 1 pixel at the left or right corner of a line. Would maybe need some tests of a larger amount of lines, because good image manipulation programs like ImageMagick work internally on subpixel level. Also Tesseract has better accuracy with exactly deskewed lines.
Baseline
in XML isn't always reliable.Looking at some samples between 2 and 10 degrees they seem to be short sequences with characters at different vertical positions, which maybe are also skewed within 2 degrees. Thus it would be better to estimate the average skew of each
TextRegion
first, and take this under consideration.Deskew between 10 and 80 degrees
There are ~50 images in this range. Some are diagonal labels in tables. Some are more like illustrations as part of advertisements. This sort of advertisements is a special issue without an easy solution (It's an extra issue).
Segment regions into lines
Maybe it' a better approach to create region images first of pure text regions and use ocropy for segmentation and dewarp. The advantages are that ocropy uses masks and removes speckles outside the mask, also adds white pixels at the borders. Disadvantages are binary images (no color), position info is lost, and
ocropus-dewarp
scales the images down (would need a closer look into the source code to maybe find a better solution). Also ocropy does not ignore some noise at begin and end of the line. This can be cut away using the GT texts.BTW adding white pixels around the borders of the line images in
GT4Hist
would maybe be an improvement.Each of the steps improves recognition accuracy of degraded images a few percent. Should GT images for training be "too good"? IMHO it's not cheating as long as it's done with available tools of a modern OCR workflow. The remaining image quality will be noisy enough.
Text of large hight (e.g. titles)
They are all truncated at the top.
TextRegion
seems ok, but the polygon ofTextLine
has a wrong hight. This allows to cut out a region image, if the region contains only one line and segment it.Update information in Page-XML
In case of skew and warp it makes not sense and isn't easy to update the polygons. For large titles it makes sense.
Also
fontFamily
andfontSize
could be updated with a better guess.
fontSize
(should be in points according to the Page-XML specification) needs reliable information of dpi during scanning or the original paper format. ForfontFamily
it's questionable. It could be classified as"'Times New Roman', serif"
by default and changed in cases ofsans-serif
, Fraktur, Textura or Gothic if automatically classified.