LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

any chance that the rows and cells in the Abbyy file would be kept? #64

Open kosloot opened 3 years ago

kosloot commented 3 years ago

Just quick question: any chance that the rows and cells in the Abbyy file would be kept by the converter?

Originally posted by @pirolen in https://github.com/LanguageMachines/foliautils/issues/62#issuecomment-896096217

kosloot commented 3 years ago

I would have to look into that. There are FoLiA constructions for <row> and <cell> so it would be doable. (nb: your example file has rows with 1 cell. quite odd)

pirolen commented 3 years ago

Nice, thank you very much for investigating it; I was just wondering.

The single cell oddity is due to the fact that these tables actually hold data from registries that have entries that can span several lines in a weakly structured way (e.g. using indentation levels).
The paragraphs that Abbyy thinks to recognize are not properly capturing the entry boundaries, since the entry structuring logic of the printed pages is often complex. The 'table cells' can keep the lines together; so the table format is simply a workaround that the Abbyy OCR postcorrection app allows, i.e. using the app, human correctors manually separate the entries from each other by drawing a table around them.

pirolen commented 3 years ago

Please find attached a proper table example in Abbyy XML, for the printed original please see the png.

b1_3_1_mwtext_ostpreuss_pp109_277_036

b1_3_1_mwtext_ostpreuss_pp109_277_036.table.xml.txt