Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

Prevent PageXML training on previous OCR results #265

Closed andbue closed 3 years ago

andbue commented 3 years ago

I accidentally reverted my own fix that prevented the PAGE reader from including existing OCR text in the GT...

ChWick commented 3 years ago

Is this line potentially dangerous?

tequivs = [te for te in textline.findall("./ns:TextEquiv", namespaces=ns) if "index" not in te.attrib]

It will ignore all lines with an index attribute? This might also happen for GT lines, doesn't it?

andbue commented 3 years ago

The block "if not tequivs: tequivs = [te..." is meant for files that contain only one TextEquiv without an index attribute (cf. https://github.com/Calamari-OCR/calamari/pull/160). When there are GT lines with an index present, they will be found earlier, so tequivs will not be empty.

When there is more than one TextEquiv without index present, we'll just take the first one and print a warning.

ChWick commented 3 years ago

Ah, I understand now!