Adapting the python library OCRmyPDF to run as an AWS Lambda Function.
From the OCRmyPDF readme:
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.
Purpose
The purpose of this application was to adapt the OCRmyPDF application/library to be run on AWS servers to serve as a dynamic document OCR processing service. We loved the implementation of OCRmyPDF but felt it would work well if adapted to an Amazon Lambda Function.
What's in the repo?
This repository contains all external libraries required by OCRmyPDF compiled on and extracted from an Amazon Linux EC2 instance. It also contains all python packages compiled on and extracted from an Amazon Linux EC2 instance. Lastly, it features some minor changes to the OCRmyPDF source itself to make it Lambda friendly.
Calling the Event
The event currently supports only a few, basic, parameters, which we intend on expanding. The parameters are:
Key
Description
awsRegion
The region where the S3 bucket is located
s3.bucket.name
The name of the S3 bucket
s3.object.key
The key for the object in the S3 bucket
pages
The pages parameter for OCRmyPDF. Ex: "1,3-5,8"
Installation
Download Latest Release
Download the latest release from this repository's releases page.
Create the Function
Go to S3 and upload the downloaded zip file to an S3 bucket of your choosing. Copy and paste the url of the file for later.
Access the AWS Lambda dashboard and click on Create Function.
Ensure Author from scratch is selected
Name your function as you please.
Make sure that your Runtime is set to Python 3.6
If you do not have an IAM Role set up for S3 access, set one up with Read, Write access on S3. I used AWS's AWSLambdaExecute policy as a base.
Setup the Function
Under the Function code section:
Set the Code entry type field to Upload a file from Amazon S3
Set the Runtime field to Python 3.6
Set the handler field to apply-ocr-to-s3-object.apply_ocr_to_document_handler
Copy and paste the S3 object URL from above into the Amazon S3 link URL field
Under the Environment Variables section:
Set the below environment variables to their respective paths
PATH : /var/task/bin
PYTHONPATH : /var/task/python
TESSDATA_PREFIX : /var/task/tessdata
Test the Function
The following test configuration can be added to lambda to test the functionality. Upload any pdf called input.pdf to an S3 bucket and run this test configuration:
lambda-OCRmyPDF
Adapting the python library OCRmyPDF to run as an AWS Lambda Function.
From the OCRmyPDF readme:
Purpose
The purpose of this application was to adapt the OCRmyPDF application/library to be run on AWS servers to serve as a dynamic document OCR processing service. We loved the implementation of OCRmyPDF but felt it would work well if adapted to an Amazon Lambda Function.
What's in the repo?
This repository contains all external libraries required by OCRmyPDF compiled on and extracted from an Amazon Linux EC2 instance. It also contains all python packages compiled on and extracted from an Amazon Linux EC2 instance. Lastly, it features some minor changes to the OCRmyPDF source itself to make it Lambda friendly.
Calling the Event
The event currently supports only a few, basic, parameters, which we intend on expanding. The parameters are:
Installation
Download Latest Release
Download the latest release from this repository's releases page.
Create the Function
Author from scratch
is selectedPython 3.6
AWSLambdaExecute
policy as a base.Setup the Function
Upload a file from Amazon S3
Python 3.6
apply-ocr-to-s3-object.apply_ocr_to_document_handler
/var/task/bin
/var/task/python
/var/task/tessdata
Test the Function
The following test configuration can be added to lambda to test the functionality. Upload any pdf called
input.pdf
to an S3 bucket and run this test configuration:input_test_configuration.json
To Do:
aws-cli