Organizations across industries are increasingly required to process large volumes of semi-structured and unstructured documents with greater accuracy and speed. They need a document processing system that ingests and analyzes documents, extracts their content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data. Many industries have stringent compliance requirements to redact personally identifiable information (PII) and protected health information (PHI) from documents. In most cases, organizations manually process documents to extract information and insights. This approach can be time consuming, expensive, and difficult to scale. Organizations need information to rapidly extract insights from documents. They can benefit from a smart document processing system as a foundation to automating business processes that rely on manual inputs and interventions. To help meet these needs, the Enhanced Document Understanding on AWS solution:
You can use each of these features standalone or configure the solution as a unique composition of workflow orchestration based on your use case. This solution deploys an AWS CloudFormation template that provides the capability to configure workflows for various use cases. The solution allows users to define custom workflows that include the types of documents required for their workflows and the type of processing each document can be subjected to. The solution also provides a web user interface (UI) for users to upload documents. Once the documents are uploaded, a backend workflow orchestrates AWS managed AI services to process documents at scale. For a detailed solution implementation guide, refer to Enhanced Document Understanding on AWS.
Deploying this solution with the default parameters deploys the following components in your AWS account.
The high-level process flow for the solution components deployed with the CloudFormation template is as follows:
s3:PutObject
event on the default Amazon EventBridge event bus.s3:PutObject
event invokes the workflow orchestrator AWS Lambda function. This function uses the configuration stored in the Configuration for orchestrating workflows DynamoDB table to determine the workflows to be called.Note: The deployment to Amazon Kendra is optional. If not deployed the search feature is not available.
Note: You can also test the UI project locally by deploying the API endpoints and the rest of the infrastructure. To do so, follow either of the below two options and then refer UI project for details.
There are two options for deployment into your AWS account:
cdk deploy
Following are pre-requisites to build and deploy locally:
ensurepip
and pip
; however, if running in an environment without these packages you will need to manually install them (e.g. a minimal docker image). See pip's installation guide for details.Note: Configure the AWS CLI with your AWS credentials or have them exported in the CLI terminal environment. In case the credentials are invalid or expired, running cdk deploy
produces an error.
After cloning the repo from GitHub, complete the following steps:
cd <project-directory>/source/infrastructure
npm install
npm run build
cdk deploy
To deploy CloudFormation templates, follow the instruction for creating a custom build.
├── email-templates [email templates]
├── sample-documents [sample documents for different industry verticals]
├── images [images used for markdown files]
│ ├── architecture.png
│ └── ui-components.png
├── infrastructure [CDK infrastructure]
│ ├── bin
│ ├── cdk.json
│ ├── cdk.out
│ ├── coverage
│ ├── jest.config.js
│ ├── lib
│ ├── node_modules
│ ├── package-lock.json
│ ├── package.json
│ ├── test
│ └── tsconfig.json
├── Lambda [Lambda functions for the application]
│ ├── create-presigned-url
│ ├── custom-resource
│ ├── entity-detection
│ ├── fetch-records
│ ├── get-inferences
│ ├── layers
│ ├── redact-content
│ ├── search
│ ├── send-notification
│ ├── text-extract
│ ├── upload-document
│ └── workflow-orchestrator
├── pre-build-jars.sh [pre-builds libraries required for the CDK infrastructure project]
├── run-all-tests.sh [shell script that can run unit tests for the entire project]
├── ui [Web App project for UI]
│ ├── README.md
│ ├── build
│ ├── node_modules
│ ├── package-lock.json
│ ├── package.json
│ ├── public
│ ├── src
│ └── tsconfig.json
└── workflow-config [provides out-of-the-box workflow configurations]
├── default.json
├── entity-detection.json
├── redaction.json
├── textract-analyze-doc.json
├── textract-to-entity-medical.json
├── textract-to-entity-pii.json
├── textract-to-entity-standard.json
├── textract-to-entity.json
└── textract.json
Run the following command:
git clone https://github.com/aws-solutions/<repository_name>
cd <rootDir>/source/infrastructure
npm install
Note: To run the unit tests, docker must be installed and running, and valid AWS credentials must be configured.
cd <rootDir>/source
chmod +x ./run-all-tests.sh
./run-all-tests.sh
export DIST_OUTPUT_BUCKET=my-bucket-name
export VERSION=my-version
cd <rootDir>/deployment
chmod +x ./build-s3-dist.sh
./build-s3-dist.sh $DIST_OUTPUT_BUCKET $SOLUTION_NAME $VERSION $CF_TEMPLATE_BUCKET_NAME
Parameter details:
$DIST_OUTPUT_BUCKET - This is the global name of the distribution. For the bucket name, the AWS Region is added to the global name (example: 'my-bucket-name-us-east-1') to create a regional bucket. The lambda
artifact should be uploaded to the regional buckets for the CloudFormation template to pick it up for deployment.
$SOLUTION_NAME - The name of This solution (example: document-understanding-solution)
$VERSION - The version number of the change
$CF_TEMPLATE_BUCKET_NAME - The name of the S3 bucket where the CloudFormation templates should be uploaded
When you create and use buckets, we recommended that you:
Note: You must have the AWS CLI installed.
aws s3 cp ./global-s3-assets/ s3://my-bucket-name-<aws_region>/document-understanding-solution/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
aws s3 cp ./regional-s3-assets/ s3://my-bucket-name-<aws_region>/document-understanding-solution/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
This solution collects anonymized operational metrics to help AWS improve the quality and features of the solution. For more information, including how to disable this capability, please see the implementation guide.
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.