Support conversion from and to Textract JSON

UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

MIT License

178 stars 23 forks source link

Open scottschreckengaust opened 4 years ago

scottschreckengaust commented 4 years ago

Textract has an output results format in JSON.

stweil commented 1 year ago

Conversion from Textract to PAGE XML was now added with pull request #160.

bertsky commented 1 year ago

Alas, the new converter is still incomplete, so

forms, and

tables

bertsky commented 1 year ago

Update: tables work now, but the converter submodule needs to be updated here

kba commented 1 year ago

Update: tables work now, but the converter submodule needs to be updated here

I've updated the vendor submodules, including textract2page in https://github.com/UB-Mannheim/ocr-fileformat/pull/166. The tables branch is not yet merged to master though and I think there are files missing to properly run the tests.