Move splitting of PDFs/TIFFs and rotation of EXIF-tagged images from notebook util to a SageMaker Processing job, to accelerate processing for large corpora via horizontal and vertical scaling.
This involves building a custom container image for the job, since the pre-existing logic depends on poppler which can't be just pip installed - otherwise we could have considered using a FrameworkProcessor.
The image is built in notebook via sm-docker, rather than:
Using plain docker in the notebook, because this command may not be available in SMStudio
Building and publishing the image in the CDK stack, because this could increase expectation on user's local/build environment and because the CDK DockerImageAsset currently depends on external resources to publish to a named/non-CDK-internal ECR registry (as discussed in this issue)
We use the sm-docker CLI to build the image in the notebook because plain docker may not be available in SMStudio.
Testing done:
Redeployed CDK stack in (not quite fresh) environment, cleared out relevant S3 folders and checked data preparation still runs okay.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available: #1
Description of changes:
Move splitting of PDFs/TIFFs and rotation of EXIF-tagged images from notebook util to a SageMaker Processing job, to accelerate processing for large corpora via horizontal and vertical scaling.
This involves building a custom container image for the job, since the pre-existing logic depends on
poppler
which can't be just pip installed - otherwise we could have considered using a FrameworkProcessor.The image is built in notebook via
sm-docker
, rather than:docker
in the notebook, because this command may not be available in SMStudioWe use the
sm-docker
CLI to build the image in the notebook because plaindocker
may not be available in SMStudio.Testing done:
Redeployed CDK stack in (not quite fresh) environment, cleared out relevant S3 folders and checked data preparation still runs okay.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.