chronograph-pe / lambda-OCRmyPDF

Adapting the python library OCRmyPDF to run in an AWS Lambda Function
GNU Affero General Public License v3.0
16 stars 4 forks source link

README.md #1

Closed hdwatts closed 5 years ago

hdwatts commented 5 years ago

chronograph-logo-no-icon-normal

lambda-OCRmyPDF

Adapting the python library OCRmyPDF to run as an AWS Lambda Function.

From the OCRmyPDF readme:

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

Purpose

The purpose of this application was to adapt the OCRmyPDF application/library to be run on AWS servers to serve as a dynamic document OCR processing service. We loved the implementation of OCRmyPDF but felt it would work well if adapted to an Amazon Lambda Function.

What's in the repo?

This repository contains all external libraries required by OCRmyPDF compiled on and extracted from an Amazon Linux EC2 instance. It also contains all python packages compiled on and extracted from an Amazon Linux EC2 instance. Lastly, it features some minor changes to the OCRmyPDF source itself to make it Lambda friendly.

Calling the Event

The event currently supports only a few, basic, parameters, which we intend on expanding. The parameters are:

Key Description
awsRegion The region where the S3 bucket is located
s3.bucket.name The name of the S3 bucket
s3.object.key The key for the object in the S3 bucket
pages The pages parameter for OCRmyPDF. Ex: "1,3-5,8"

Installation

Download Latest Release

Download the latest release from this repository's releases page.

Create the Function

Setup the Function

Test the Function

The following test configuration can be added to lambda to test the functionality. Upload any pdf called input.pdf to an S3 bucket and run this test configuration:

input_test_configuration.json

{
  "pages": "1",
  "awsRegion": "us-east-1",
  "s3": {
    "bucket": {
      "name": "[YOUR BUCKET NAME HERE]"
    },
    "object": {
      "key": "input.pdf"
    }
  }
}

To Do: