Closed andbue closed 3 years ago
Is this line potentially dangerous?
tequivs = [te for te in textline.findall("./ns:TextEquiv", namespaces=ns) if "index" not in te.attrib]
It will ignore all lines with an index attribute? This might also happen for GT lines, doesn't it?
The block "if not tequivs: tequivs = [te..." is meant for files that contain only one TextEquiv without an index attribute (cf. https://github.com/Calamari-OCR/calamari/pull/160). When there are GT lines with an index present, they will be found earlier, so tequivs will not be empty.
When there is more than one TextEquiv without index present, we'll just take the first one and print a warning.
Ah, I understand now!
I accidentally reverted my own fix that prevented the PAGE reader from including existing OCR text in the GT...