UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
178 stars 23 forks source link

Support conversion from and to Textract JSON #122

Open scottschreckengaust opened 4 years ago

scottschreckengaust commented 4 years ago

Textract has an output results format in JSON.

https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf

Specifically, the three types of analysis, https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html for the categories:

  1. text,
  2. forms, and
  3. tables
stweil commented 1 year ago

Conversion from Textract to PAGE XML was now added with pull request #160.

bertsky commented 1 year ago

Alas, the new converter is still incomplete, so

  • forms, and
  • tables

do not work yet. See https://github.com/slub/textract2page/issues/2

bertsky commented 1 year ago

Update: tables work now, but the converter submodule needs to be updated here

kba commented 1 year ago

Update: tables work now, but the converter submodule needs to be updated here

I've updated the vendor submodules, including textract2page in https://github.com/UB-Mannheim/ocr-fileformat/pull/166. The tables branch is not yet merged to master though and I think there are files missing to properly run the tests.