Closed stweil closed 4 years ago
The PRImA page viewer complains about the negative coordinates. also shows that vertical offset, so displays texts which do not match the line under the mouse pointer for the above PAGE XML and its corresponding TIFF image. ocr-validate
also reports an error:
$ ocr-validate page-2013-07-15 *1890*xml
mXSDFilename: /home/stweil/src/github/OCR-D/venv-20200408/share/ocr-fileformat/xsd/page-2013-07-15.xsd
mXMLFilename: /home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml
/home/stweil/src/github/impresso/NZZ-black-letter-ground-truth/xml/NZZ_groundtruth/nzz_18901222_0_0_a1_p1_1.xml fails to validate because:
cvc-pattern-valid: Value '4,-27 3842,-27 3842,5524 4,5524' is not facet-valid with respect to pattern '([0-9]+,[0-9]+ )+([0-9]+,[0-9]+)' for type 'PointsType'.
At: 16:63
See new issue https://github.com/Transkribus/TranskribusCore/issues/45.
A closer look at nzz_18901222_0_0_a1_p1_1.xml
with the PRImA page viewer shows that only some text regions with their text lines are affected by a vertical shift.
The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:
<PrintSpace> <Coords points="4,-27 3842,-27 3842,5524 4,5524"/> </PrintSpace>
This is invalid by any interpretation, PAGE-XML syntax forbids negative coordinates. This must be fixed in Transkribus.
The
PrintSpace
tag is not handled bypage2img.py
, nor is it handled in ocrd_segment.
There's no need to act on PrintSpace
in any way for an image extractor. All PAGE-XML coordinates are absolute (i.e. they refer to imageFilename
). Even on the page level, the only relevant element for cropping a bbox rectangle is Border
.
In summary, I don't think this is a bug in either page2img or ocrd-segment-extract-*.
Thank you. That confirms my latest impression. The Transkribus PAGE for Neue Zürcher Zeitung is at least partially a complete mess, word boxes outside of the corresponding lines, line boxes outside of regions. I see no chance to fix that programmatically and will now try to use the original coordinates which were generated by ABBYY FineReader.
Closing this issue. I created https://github.com/Transkribus/TranskribusCore/issues/46 to address those errors.
and will now try to use the original coordinates which were generated by ABBYY FineReader.
IIRC @wrznr also uses a pipeline to convert ABBYY output in ALTO format to PAGE (reducing bbox overlap via clipping and resegmentation) but recently discovered a bug introduced by deskewing offset?
We also noticed negative offsets in PAGE XML exports from Transkribus (one can just set them 0). If I remember correctly, we had sometimes problems running HTR (after running ABBYY for layout recognition) on some pages where typically line regions at the border of the page existed (presumably with negative coordinates).
Thanks for your report. Setting the negative values for PrintSpace
to zero helps indeed to fix the invalid XML, so it is possible to load the data in the viewer after that fix. It does not cure the wrong word and line boxes.
It does not cure the wrong word and line boxes.
Then the problem runs deeper. (There is at least one plausible and harmless reason for negative coordinates, and that's segmenting in a cropped and deskewed image, then converting back to absolute coordinates. The rotation will enlarge the image, introducing an offset, which has to be subtracted when converting the coordinates. But if the segments themselves have an apparent offset after conversion, then there's another problem.)
Here is an example of line image and matching text, both extracted with
page2img.py
:Donnerstag und Samstag wird das Blatt künftig
Obviously there is a vertical offset, the text belongs to the next line, so a wrong image was extracted. All other line images show a similar vertical offset. The PAGE XML file was created by Transkribus, and it contains data which might be the cause for that:
The
PrintSpace
tag is not handled bypage2img.py
, nor is it handled in ocrd_segment.ABBYY produced this PAGE XML which contains good coordinates for the text line: