aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

Alternative OCR options through Amazon SageMaker #24

Closed athewsey closed 1 year ago

athewsey commented 1 year ago

Issue #, if available: #22

Description of changes:

Add alternative open-source OCR integration options, for users needing to work with documents in languages not supported by Amazon Textract. Previously, the multi-lingual LayoutXLM model supported by the sample has been able to work with non-latin, low-resource languages e.g. Thai - but Amazon Textract OCR support was missing rendering the E2E pipeline unusable. With this change, an example integration is provided (using Tesseract OCR), and a framework in place for alternative integrations to be added or customized by users.

This PR also adds a streamlined "workshop" notebook for users/events prioritising speed over explanation.

Testing done:

Tested both standard Amazon Textract and custom Tesseract OCR flows in fresh environments. For the Tesseract OCR flow:


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

athewsey commented 1 year ago

This is a big change with some rough edges likely still lurking, but has now been tested in multiple environments so ready to merge and fix forward.