Refactor PDF pre-processing into a SageMaker Processing Job

Issue #, if available: #1

Description of changes:

Move splitting of PDFs/TIFFs and rotation of EXIF-tagged images from notebook util to a SageMaker Processing job, to accelerate processing for large corpora via horizontal and vertical scaling.

This involves building a custom container image for the job, since the pre-existing logic depends on poppler which can't be just pip installed - otherwise we could have considered using a FrameworkProcessor.

The image is built in notebook via sm-docker, rather than:

Using plain docker in the notebook, because this command may not be available in SMStudio
Building and publishing the image in the CDK stack, because this could increase expectation on user's local/build environment and because the CDK DockerImageAsset currently depends on external resources to publish to a named/non-CDK-internal ECR registry (as discussed in this issue)

We use the sm-docker CLI to build the image in the notebook because plain docker may not be available in SMStudio.

Testing done:

Redeployed CDK stack in (not quite fresh) environment, cleared out relevant S3 folders and checked data preparation still runs okay.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws-samples / amazon-textract-transformer-pipeline

Refactor PDF pre-processing into a SageMaker Processing Job #2