All classes are under active development and subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.
This is a collection of sample workflows designed to showcase the usage of the Amazon Textract IDP CDK Constructs
The samples use the AWS Cloud Development Kit (AWS CDK). Also it requires Docker.
You can spin up a AWS Cloud9 instance, which has the AWS CDK and docker already set up.
After cloning the repository, install the dependencies:
pip install -r requirements.txt
Then deploy a stack, for example:
cdk deploy DemoQueries
At the moment there are 10 stacks available:
Deploy using
cdk deploy DocumentSplitterWorkflow
This samples includes a new component called DocumentSpliter, which takes and input document of type TIFF or PDF and outputs each individual page to an S3 location and adds the list of filenames to an array.
That array is then used in a Step Functions Map state and processed in parallel. Each iteration classifies the page and then in case of a W2 or paystub routes to an extraction process or not. At the end all the W2s and Paystubs are extracted and the map returns and array with the page numbers and their classification result.
When you look at the execution in the AWS Web Console under Step Functions and look at the execution, you may not see the correct rending in the "Graph Insepctor" while the "Execution event history" is still loading indicated by the process circle spinning next to the "Execution event history" text. Wait for it to finish.
We are planning to have a better UI experience in the future.
Deploy using
cdk deploy PaystubAndW2Spacy
This sample showcases a number of components, including classification using Comprehend and routing based on the document type, followed by configuration based on the document types.
It is called Paystub and W2, because those are the ones configured in the RouteDocType and the DemoIDP-Configurator.
At the moment it does single page, check the Document Splitter Workflow
Check the API definition for the Constructs at: https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/blob/main/API.md
From top to bottom:
The Aurora RDS Cluster runs in a private VPC. To get there, check the commented section for EC2 in the sample stack. Put in your setting for Security Groups, AMI and keypair. (We'll make it easier in the future)
Simple example of a flow only calling synchronous Textract for DetectText.
Deploy using
cdk deploy PaystubAndW2Comprehend
This sample showcases a number of components, including classification using Comprehend and routing based on the document type, followed by configuration based on the document types. It is called Paystub and W2, because those are the ones configured in the RouteDocType and the DemoIDP-Configurator.
At the moment it does single page, check the Document Splitter Workflow
Here is the flow:
Deploy using
cdk deploy SimpleAsyncWorkflow
Very basic workflow to demonstrate AsyncProcessing. This out-of-the-box will only call with DetectText, generating OCR output. When you are interested in running specific queries or features like forms or tables on a set of documents, look at DemoQueries
Deploy using
cdk deploy DemoQueries
Basic workflow to demonstrate how Sync and Async can be routed based on numberOfPages and numberOfQueries and how the workflow can be triggered with queries. Calls AnalyzeDocument with the 2 sample queries. Obviously, modify to your own needs. The location in the code where queries are configed when starting the workflow in the lambda/start_queries/app/start_execution.py when kicking off the Step Functions workflow. The GenerateCsvTask will output one CSV file to S3 with key/value, confidence scores and bounding box information based on the forms and queries output.
Deploy using
cdk deploy InsuranceStack
Simple flow including A2I
Deploy using
cdk deploy SimpleAsyncWorkflow
Simple flow calling the Textract AnalzyeID API.
Deploy using
cdk deploy AnalyzeID
Simple flow calling the Textract AnalyzeExpense API.
Deploy using
cdk deploy AnalyzeExpense
Example of using the Amazon Textract Analyze Lending API to extract information from mortgage document, then generate a CSV and process pages that were marked UNCLASSIFIED by the Analzye Lending API, process them in a separate branch, extract information and generate a CSV as well
Deploy using
cdk deploy LendingWorkflow
The workflow uses a custom classification model to identify the HOMEOWNERS_INSURANCE_APPLICATION and CONTACT_FORM. The classifier ist just trained on the sample images and for demo purposes only.
aws s3 cp s3://amazon-textract-public-content/idp-cdk-samples/lending_console_demo_with_contacts.pdf $(aws cloudformation list-exports --query 'Exports[?Name==`LendingWorkflow-DocumentUploadLocation`].Value' --output text)
then open the StepFunction flow.
aws cloudformation list-exports --query 'Exports[?Name==`LendingWorkflow-StepFunctionFlowLink`].Value' --output text
This is an example how to populate an OpenSearch service with data from documents. The index pattern includes:
Deploy using
cdk deploy OpenSearchWorkflow
The workflow first splits the document into chunks of max 3000 pages, because that is the limit of the Textract service for asynchronous processing. Each chunk is then send to StartDocumentAnalysis extracing the OCR information from the page. The meta-data added to the context of the StepFunction workflow includes information required for creating the OpenSearch bulk import file, including ORIGIN_FILE_NAME and START_PAGE_NUMBER.
Take a look at the sample workflows. Copy one as a starting point and go from there.