Open kosloot opened 3 years ago
I would have to look into that. There are FoLiA constructions for <row>
and <cell>
so it would be doable.
(nb: your example file has rows with 1 cell. quite odd)
Nice, thank you very much for investigating it; I was just wondering.
The single cell oddity is due to the fact that these tables actually hold data from registries that have entries that can span several lines in a weakly structured way (e.g. using indentation levels).
The paragraphs that Abbyy thinks to recognize are not properly capturing the entry boundaries, since the entry structuring logic of the printed pages is often complex.
The 'table cells' can keep the lines together; so the table format is simply a workaround that the Abbyy OCR postcorrection app allows, i.e. using the app, human correctors manually separate the entries from each other by drawing a table around them.
Please find attached a proper table example in Abbyy XML, for the printed original please see the png.
Just quick question: any chance that the rows and cells in the Abbyy file would be kept by the converter?
Originally posted by @pirolen in https://github.com/LanguageMachines/foliautils/issues/62#issuecomment-896096217