UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Table extraction #164

Open kba opened 1 year ago

kba commented 1 year ago

From https://github.com/OCR-D/ocrd_fileformat/issues/46

@kba:

It would be very useful to have a transformation that extracts any tables from PAGE-XML to CSV.

@bertsky:

Thoughts:

  • each TableRegion needs its own CSV, so it's not immediately clear how this fits with the page→page converter paradigm (e.g. for page→text, one could simply paste the CSV in the middle of the plaintext, but maybe creating a multitude of output files is usually better)
  • CSV may already be too coarse (no multi-span, no header distinction)
  • perhaps better transfer to ocr-fileformat subrepo?