aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
92 stars 26 forks source link

[Enhancement] Drop-in alternative open-source OCR engine(s) #22

Closed athewsey closed 1 year ago

athewsey commented 2 years ago

This sample is compatible with multi-lingual layout-language models like LayoutXLM, but uses Amazon Textract for initial OCR which today only supports a subset of these languages. For example Thai and Vietnamese are supported by LayoutXLM but not currently by Amazon Textract.

It would be useful for this sample to support easy switching to an alternative, open-source-based OCR - for any users that want to work with low-resource languages.

Design Ideas