Alternative OCR options through Amazon SageMaker

Issue #, if available: #22

Description of changes:

Add alternative open-source OCR integration options, for users needing to work with documents in languages not supported by Amazon Textract. Previously, the multi-lingual LayoutXLM model supported by the sample has been able to work with non-latin, low-resource languages e.g. Thai - but Amazon Textract OCR support was missing rendering the E2E pipeline unusable. With this change, an example integration is provided (using Tesseract OCR), and a framework in place for alternative integrations to be added or customized by users.

This PR also adds a streamlined "workshop" notebook for users/events prioritising speed over explanation.

Testing done:

Tested both standard Amazon Textract and custom Tesseract OCR flows in fresh environments. For the Tesseract OCR flow:

Deployed via the bootstrap CloudFormation stack with tesseract parameters
Used the Optional Extras NB instructions on OCR instead of the Tesseract OCR function
Used the manifest file option for OCRing a subset of documents instead of full set

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws-samples / amazon-textract-transformer-pipeline

Alternative OCR options through Amazon SageMaker #24