Add alternative open-source OCR integration options, for users needing to work with documents in languages not supported by Amazon Textract. Previously, the multi-lingual LayoutXLM model supported by the sample has been able to work with non-latin, low-resource languages e.g. Thai - but Amazon Textract OCR support was missing rendering the E2E pipeline unusable. With this change, an example integration is provided (using Tesseract OCR), and a framework in place for alternative integrations to be added or customized by users.
This PR also adds a streamlined "workshop" notebook for users/events prioritising speed over explanation.
Testing done:
Tested both standard Amazon Textract and custom Tesseract OCR flows in fresh environments. For the Tesseract OCR flow:
Deployed via the bootstrap CloudFormation stack with tesseract parameters
Used the Optional Extras NB instructions on OCR instead of the Tesseract OCR function
Used the manifest file option for OCRing a subset of documents instead of full set
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available: #22
Description of changes:
Add alternative open-source OCR integration options, for users needing to work with documents in languages not supported by Amazon Textract. Previously, the multi-lingual LayoutXLM model supported by the sample has been able to work with non-latin, low-resource languages e.g. Thai - but Amazon Textract OCR support was missing rendering the E2E pipeline unusable. With this change, an example integration is provided (using Tesseract OCR), and a framework in place for alternative integrations to be added or customized by users.
This PR also adds a streamlined "workshop" notebook for users/events prioritising speed over explanation.
Testing done:
Tested both standard Amazon Textract and custom Tesseract OCR flows in fresh environments. For the Tesseract OCR flow:
tesseract
parametersBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.