OCR-D / format-converters

Converters for various file formats used for representing OCR
Apache License 2.0
12 stars 5 forks source link

Extracted line images with wrong vertical offset #16

Closed stweil closed 4 years ago

stweil commented 4 years ago

Here is an example of line image and matching text, both extracted with page2img.py:

Donnerstag und Samstag wird das Blatt künftig

sample line image

Obviously there is a vertical offset, the text belongs to the next line, so a wrong image was extracted. All other line images show a similar vertical offset. The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:

[...]
<Page imageFilename="0111_nzz_18901222_0_0_a1_p1_1.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="117,1676 975,1676 975,1941 117,1941"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="122,1689 976,1689 976,1741 122,1741"/>
            <Baseline points="121,1762 976,1764"/>
            [...]

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

ABBYY produced this PAGE XML which contains good coordinates for the text line:

[...]
<Page imageFilename="1200024.tif" imageWidth="3839" imageHeight="5551">
    <PrintSpace>
        <Coords points="0,0 3838,0 3838,5551 0,5551"/>
    </PrintSpace>
    [...]
    <TextRegion type="paragraph" id="r_5_3" custom="readingOrder {index:12;}">
        <Coords points="112,1678 970,1678 970,1943 112,1943"/>
        [...]
        <TextLine id="tl_32" primaryLanguage="German" custom="readingOrder {index:1;}">
            <Coords points="115,1723 969,1723 969,1775 115,1775"/>
            <Baseline points="115,1798 969,1798"/>
            [...]
stweil commented 4 years ago

The PRImA page viewer complains about the negative coordinates. also shows that vertical offset, so displays texts which do not match the line under the mouse pointer for the above PAGE XML and its corresponding TIFF image. ocr-validate also reports an error:

$ ocr-validate page-2013-07-15 *1890*xml
mXSDFilename: /home/stweil/src/github/OCR-D/venv-20200408/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml
/home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml fails to validate because: 

cvc-pattern-valid: Value '4,-27 3842,-27 3842,5524 4,5524' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.
At: 16:63
stweil commented 4 years ago

See new issue https://github.com/Transkribus/TranskribusCore/issues/45.

stweil commented 4 years ago

A closer look at nzz_18901222_0_0_a1_p1_1.xml with the PRImA page viewer shows that only some text regions with their text lines are affected by a vertical shift.

bertsky commented 4 years ago

The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:

    <PrintSpace>
        <Coords points="4,-27 3842,-27 3842,5524 4,5524"/>
    </PrintSpace>

This is invalid by any interpretation, PAGE-XML syntax forbids negative coordinates. This must be fixed in Transkribus.

The PrintSpace tag is not handled by page2img.py, nor is it handled in ocrd_segment.

There's no need to act on PrintSpace in any way for an image extractor. All PAGE-XML coordinates are absolute (i.e. they refer to imageFilename). Even on the page level, the only relevant element for cropping a bbox rectangle is Border.

In summary, I don't think this is a bug in either page2img or ocrd-segment-extract-*.

stweil commented 4 years ago

Thank you. That confirms my latest impression. The Transkribus PAGE for Neue Zürcher Zeitung is at least partially a complete mess, word boxes outside of the corresponding lines, line boxes outside of regions. I see no chance to fix that programmatically and will now try to use the original coordinates which were generated by ABBYY FineReader.

stweil commented 4 years ago

Closing this issue. I created https://github.com/Transkribus/TranskribusCore/issues/46 to address those errors.

bertsky commented 4 years ago

and will now try to use the original coordinates which were generated by ABBYY FineReader.

IIRC @wrznr also uses a pipeline to convert ABBYY output in ALTO format to PAGE (reducing bbox overlap via clipping and resegmentation) but recently discovered a bug introduced by deskewing offset?

simon-clematide commented 4 years ago

We also noticed negative offsets in PAGE XML exports from Transkribus (one can just set them 0). If I remember correctly, we had sometimes problems running HTR (after running ABBYY for layout recognition) on some pages where typically line regions at the border of the page existed (presumably with negative coordinates).

stweil commented 4 years ago

Thanks for your report. Setting the negative values for PrintSpace to zero helps indeed to fix the invalid XML, so it is possible to load the data in the viewer after that fix. It does not cure the wrong word and line boxes.

bertsky commented 4 years ago

It does not cure the wrong word and line boxes.

Then the problem runs deeper. (There is at least one plausible and harmless reason for negative coordinates, and that's segmenting in a cropped and deskewed image, then converting back to absolute coordinates. The rotation will enlarge the image, introducing an offset, which has to be subtracted when converting the coordinates. But if the segments themselves have an apparent offset after conversion, then there's another problem.)