aws-solutions / document-understanding-solution

Example of integrating & using Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, Amazon Kendra to automate the processing of documents for use cases such as enterprise search and discovery, control and compliance, and general business process workflow.
https://aws.amazon.com/solutions/implementations/document-understanding-solution/
Apache License 2.0
233 stars 90 forks source link
amazon-comprehend amazon-elasticsearch amazon-kendra amazon-textract aws aws-cdk aws-machine-learning cdk machine-learning

Deprecation Notice

As of 09/14/2023, Document Understanding Solution has been deprecated and will not be receiving any additional features or updates. We encourage customers to explore the new solution: https://aws.amazon.com/solutions/implementations/enhanced-document-understanding-on-aws/.

Document Understanding Solution

DUS leverages the power of Amazon Textract, Amazon Comprehend , Amazon Comprehend Medical Amazon OpenSearch Service and Amazon Kendra to provide digitization , domain-specific data discovery, redaction controls , structural component extraction and other document processing & understanding capabilities.

img

Architecture Diagram

img

Note

Current document formats supported : PDF,JPG,PNG

Current maximum document file size supported : 150MB

Current concurrent document uploads (via UI ) supported : 100

1. CICD Deploy

Requirements

Getting Started with CICD Deploy

Note: You will have to create an S3 bucket with the template 'my-bucket-name-'; aws_region is where you are testing the customized solution.

For example, you create a bucket called my-solutions-bucket-us-east-1,

chmod +x ./deployment/build-s3-dist.sh
./deployment/build-s3-dist.sh <bucket-name-minus-region> <solution-name> <version>

For example,

./deployment/build-s3-dist.sh my-solutions-bucket document-understanding-solution v1.0.0
aws s3 cp ./deployment/global-s3-assets/ s3://my-bucket-name-<aws_region>/<solution_name>/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
aws s3 cp ./deployment/regional-s3-assets/ s3://my-bucket-name-<aws_region>/<solution_name>/<my-version>/ --recursive --acl bucket-owner-full-control --profile aws-cred-profile-name
aws cloudformation create-stack --stack-name DocumentUnderstandingSolutionCICD --template-url https://my-bucket-name-<aws_region>.s3.amazonaws.com/<solution_name>/<my_version>/document-understanding-solution.template --parameters ParameterKey=Email,ParameterValue=<my_email> --capabilities CAPABILITY_NAMED_IAM --disable-rollback

This solutions will create 7 S3 buckets that need to be manually deleted when the stack is destroyed (Cloudformation will only delete the solution specific CDK toolkit bucket. The rest are preserved to prevent accidental data loss).

The solution is set up to reserve lambda concurrency quota. This is both to limit the scale of concurrent Lambda invocations as well to ensure sufficient capacity is available for the smooth functioning of the demo. You can tweak the "API_CONCURRENT_REQUESTS" value in source/lib/cdk-textract-stack.ts for changing the concurrency Lambda limits

Notes

Development Deploy

The instructions below cover installation on Unix-based Operating systems like macOS and Linux. You can use a AWS Cloud9 environment or EC2 instance (recommended: t3.large or higher on Amazon Linux platform) to deploy the solution

Requirements

Please ensure you install all requirements before beginning the deployment

To deploy using this approach, you must first set few values inside the package.json file in the source folder.

Now switch to the source directory, and use yarn to deploy the solution:

cd ./source
yarn && yarn deploy

The cli will prompt for approval on IAM Roles and Permissions twice in the full deploy. Once for the backend stack and then again for the client stack. The cli will prompt for an email. After the deploy is complete, an email will be sent to address provided with credentials for logging in.

Note:

This will create 5 or 6 S3 buckets that will have to be manually deleted when the stack is destroyed (Cloudformation does not delete them, in order to avoid data loss).

The solution is set up to reserve lambda concurrency quota. This is both to limit the scale of concurrent Lambda invocations as well to ensure sufficient capacity is available for the smooth functioning of the demo. You can tweak the "API_CONCURRENT_REQUESTS" value in source/lib/cdk-textract-stack.ts for changing the concurrency Lambda limits

Development Deploy Commands

Development Deploy Workflow and stack naming

The package.json script node stackname sets the stackname for the deploy commands. Throughout development it has been imperative to maintain multiple stacks in order to allow client app development and stack architecture development to work without creating breaking changes. When a new stackname is merged into develop it should have the most up to date deployments.

Developing Locally

Once deployed into the AWS account, you can also deploy locally for web development This application uses next.js along with next-scss — all documentation for those packages apply here. NOTE: This application uses the static export feature of next.js — be aware of the limited features available when using static export.

Start Dev Server

Generate Production Build

Code Quality Tools

This project uses Prettier to format code. It is recommended to install a Prettier extension for your editor and configure it to format on save. You can also run yarn prettier to auto-format all files in the project (make sure you do this on a clean working copy so you only commit formatting changes).

This project also uses ESLint and sass-lint to help find bugs and enforce code quality/consistency. Run yarn lint:js to run ESLint. Run yarn lint:css to run sass-lint. Run yarn lint to run them both.

Generating License Report

Run yarn license-report to generate a license report for all npm packages. See output in license-report.txt.

DUS Modes:

Classic Mode

This is first release of the DUS solution. The major services included in this mode include Amazon OpenSearch Service, Amazon Textract, Amazon Comprehend and Amazon Comprehend Medical that allow digitization, information extraction and indexing in DUS.

Kendra-Enabled Mode

In the Classic version, DUS supports searching/indexing of documents using Amazon OpenSearch Service In the kendra enabled mode, Amazon Kendra is added as an additional capability and can be used for exploring features such as Semantic Search, Adding FAQs and Access Control Lists. Simply set the enableKendra: "true" in package.json Note: Amazon Kendra Developer edition is deployed as a part of this deployment.

Read-Only Mode

In this mode, DUS will only be available in Read-Only mode and you will only be able to analyze the pre-loaded documents. You will not be able to upload documents from the web application UI. In order to enable the Read-Only mode, set isROMode: "true" in package.json. By default, this mode is disabled.

Notes

Document Bulk Processing

DUS supports bulk processing of documents. During deploy, an S3 bucket for document bulk processing is created. To use the bulk processing mode, simply upload documents under the documentDrop/ prefix. In Kendra mode, you can also upload the corresponding access control list under policy/ prefix in the same bucket with the following name convention \.metadata.json Be sure to upload the access control policy first and then the document.

Other

Cost

Delete demo application

  1. CICD Deploy:

Either run aws cloudformation delete-stack --stack-name {CICD stack}, or go to Cloudformation in the AWS Console and delete the stack that ends with "CICD". You will also have to go to CodeCommit in the console and manually delete the Repository that was created during the deploy.

  1. Development Deploy:

Make sure you are in the source directory, and then run yarn destroy.

License

This project is licensed under the Apache-2.0 License. You may not use this file except in compliance with the License. A copy of the License is located at http://www.apache.org/licenses/

Additional Notes

The intended use is for users to use this application as a reference architecture to build production ready systems for their use cases. Users will deploy this solution in their own AWS accounts and own the deployment, maintenance and updates of their applications based on this solution.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

The searchable PDF functionality is included as a pre-compiled jar binary. See the following README for more information: source/lambda/pdfgenerator/README.md