line files not following the XML id pattern tl_\d+ are missing

UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)

15 stars 3 forks source link

line files not following the XML id pattern tl_\d+ are missing #8

Open wollmers opened 4 years ago

wollmers commented 4 years ago

E. g. training set ONB_aze_18950706_1.xmlcontains

<TextLine id="line_1545028417729_5" custom="readingOrder {index:1;}">

but there are only line files following the id pattern tl_\d+.

Either we rename the line ids in the XML or use the existing ids from the XML for the line files.

stweil commented 4 years ago

11726 lines of that kind were missing. I had removed them because I (wrongly) thought that they were not relevant. That is fixed now by the commits fda352d9142b714f3d689bcb2e7f3b2e167b4c7f and 6e49c1baca4f0694b84c6e739487ced15ea6f763.

The new GT files still need all fixes which were applied to the other GT files.

wollmers commented 4 years ago

I should finish and test my script for updating the XML-files. The consequence will be, that the XML will be reformatted, thus bloat the diffs with most lines changed.