This sample is compatible with multi-lingual layout-language models like LayoutXLM, but uses Amazon Textract for initial OCR which today only supports a subset of these languages. For example Thai and Vietnamese are supported by LayoutXLM but not currently by Amazon Textract.
It would be useful for this sample to support easy switching to an alternative, open-source-based OCR - for any users that want to work with low-resource languages.
Design Ideas
In terms of engine, would be interesting to compare options e.g. EasyOCR, TrOCR, Tesseract.
The OCR response format should be wrapped to be Amazon Textract-like, to simplify users switching to fully-managed AI service if and when possible.
Maybe SageMaker Async Inference could be a nice platform for the OCR deployment? Gives lots of infrastructure + timing/payload flexibility, and the SNS callback mechanism is similar to using Amazon Textract works anyway.
Maybe a CDK construct option to toggle between Textract or OSS, rather than deploying supporting infrastructure for both?
This sample is compatible with multi-lingual layout-language models like LayoutXLM, but uses Amazon Textract for initial OCR which today only supports a subset of these languages. For example Thai and Vietnamese are supported by LayoutXLM but not currently by Amazon Textract.
It would be useful for this sample to support easy switching to an alternative, open-source-based OCR - for any users that want to work with low-resource languages.
Design Ideas