aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

Refactor PDF pre-processing into a SageMaker Processing Job #2

Closed athewsey closed 2 years ago

athewsey commented 2 years ago

Issue #, if available: #1

Description of changes:

Move splitting of PDFs/TIFFs and rotation of EXIF-tagged images from notebook util to a SageMaker Processing job, to accelerate processing for large corpora via horizontal and vertical scaling.

This involves building a custom container image for the job, since the pre-existing logic depends on poppler which can't be just pip installed - otherwise we could have considered using a FrameworkProcessor.

The image is built in notebook via sm-docker, rather than:

We use the sm-docker CLI to build the image in the notebook because plain docker may not be available in SMStudio.

Testing done:

Redeployed CDK stack in (not quite fresh) environment, cleared out relevant S3 folders and checked data preparation still runs okay.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.