Open scottschreckengaust opened 4 years ago
Conversion from Textract to PAGE XML was now added with pull request #160.
Alas, the new converter is still incomplete, so
- forms, and
- tables
do not work yet. See https://github.com/slub/textract2page/issues/2
Update: tables work now, but the converter submodule needs to be updated here
Update: tables work now, but the converter submodule needs to be updated here
I've updated the vendor submodules, including textract2page in https://github.com/UB-Mannheim/ocr-fileformat/pull/166. The tables
branch is not yet merged to master though and I think there are files missing to properly run the tests.
Textract has an output results format in JSON.
https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf
Specifically, the three types of analysis, https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html for the categories: