aws-samples / amazon-textract-serverless-large-scale-document-processing

Process documents at scale using Amazon Textract
Apache License 2.0
328 stars 165 forks source link

DEPRECATED

This repository will only be maintained with security fixes and phased out by 2023/09/30.

We recommend to use a solution based on https://github.com/aws-samples/amazon-textract-idp-cdk-constructs. Samples in Python are available https://github.com/aws-samples/amazon-textract-idp-cdk-stack-samples and in Java are available at: https://github.com/aws-samples/amazon-textract-idp-cdk-samples-java

The advantage of the new architecture are:

This workshop https://catalog.us-east-1.prod.workshops.aws/workshops/f2dd7c46-e022-4f9c-8399-dcad742be516/en-US introduces the concepts and shows ways to customize the workflows for your use case

Large scale document processing with Amazon Textract

This reference architecture shows how you can extract text and data from documents at scale using Amazon Textract. Below are some of the key attributes of the reference architecture:

Architecture

Architecture below shows the core components.

Image pipeline (Use Sync APIs of Amazon Textract)

  1. The process starts as a message is sent to an Amazon SQS queue to analyze a document.
  2. A Lambda function is invoked synchronously with an event that contains queue message.
  3. Lambda function then calls Amazon Textract and store result in different datastores for example DynamoDB, S3 or Elasticsearch.

You control the throughput of your pipeline by controlling the batch size and lambda concurrency.

Image and PDF pipeline (Use Async APIs of Amazon Textract)

  1. The process starts when a message is sent to an SQS queue to analyze a document.
  2. A job scheduler lambda function runs at certain frequency for example every 5 minutes and poll for messages in the SQS queue.
  3. For each message in the queue it submits an Amazon Textract job to process the document and continue submitting these jobs until it reaches the maximum limit of concurrent jobs in your AWS account.
  4. As Amazon Textract is finished processing a document it sends a completion notification to an SNS topic.
  5. SNS then triggers the job scheduler lambda function to start next set of Amazon Textract jobs.
  6. SNS also sends a message to an SQS queue which is then processed by a Lambda function to get results from Amazon Textract and store them in a relevant dataset for example DynamoDB, S3 or Elasticsearch.

Your pipeline runs at maximum throughput based on limits on your account. If needed you can get limits raised for concurrent jobs and pipeline automatically adapts based on new limits.

Document processing workflow

Architecture below shows overall workflow and few additional components that are used in addition to the core architecture described above to process incoming documents as well as large backfill.

Process incoming documents workflow

  1. A document gets uploaded to an Amazon S3 bucket. It triggers a Lambda function which writes a task to process the document to DynamoDB.
  2. Using DynamoDB streams, a Lambda function is triggered which writes to an SQS queue in one of the pipelines.
  3. Documents are processed as described above by "Image Pipeline" or "Image and PDF Pipeline".

Large backfill of existing documents workflow

  1. Documents already exist in an Amazon S3 bucket.
  2. We create a CSV file or use S3 inventory to generate a list of documents that needs to be processed.
  3. We create and start an Amazon S3 batch operations job which triggers a Lambda for each object in the list.
  4. Lambda writes a task to process each document to DynamoDB.
  5. Using DynamoDB streams, a Lambda is triggered which writes to an SQS queue in one of the pipelines.
  6. Documents are processed as described above by "Image Pipeline" or "Image and PDF Pipeline".

Similar architecture can be used for other services like Amazon Rekognition to process images and videos. Images can be routed to sync pipeline where as async pipeline can process videos.

Prerequisites

Setup

Deployment

Test incoming documents

Test existing backfill documents

Source code

Modify source code and update deployed stack

Cost

Delete stack

License

This library is licensed under the Apache 2.0 License.