Run the following command to install all dependencies into a virtualenv, build the CloudFormation stack from the template, and deploy the stack to your AWS account with interactive guidance.
make ready-and-deploy-guided
Run the following command to install pipenv, aws-sam-toolkit, dependencies and setup virtualenv, etc.
make bootstrap
make activate
Run the following command to run basic stylecheck, cfn-lint, etc, and build the CloudFormation template
make build
Run the following command package the CloudFormation template to be ready for CloudFormation deployment, and follow interactive guidance for deployment. This CloudFormation stack will manage the created lambdas, IAM roles and S3 bucket. (IMPORTANT: Keep note of the CloudFormation name as that will be used later)
make deploy-guided
Note:
make deploy
if there is already a local samconfig.toml filemake deploy-guided AWS_PROFILE=<profile-name> AWS_REGION=<aws-region-name>
make deploy PRE_HUMAN_LAMBDA_TIMEOUT_IN_SECONDS=600 CONSOLIDATION_LAMBDA_TIMEOUT_IN_SECONDS=600
:
make update
and continue to Step 2: Build
.Please refer to official SageMaker Ground Truth Guide to create a private workforce, and record the corresponding workteam ARN (e.g. arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT_ID}:workteam/private-crowd/{WORKFORCE_TEAM_ARN})
From the How to Build, Package and Deploy
Section, a CloudFormation stack has been deployed, which contains an S3 bucket. This bucket is used to store all data that is needed for the Labeling job, and it is also referred to by the Lambda IAM Execution Role policy to ensure Lambda functions have necessary permission to access the data. The S3 bucket Name can be easily found in CloudFormation Stack Outputs with Key of SemiStructuredDocumentsS3Bucket
.
You need to upload the source Semi-Structure Documents into this Bucket. Here is a sample AWS CLI command you can use to upload source documents from local directory to S3 bucket:
AWS_REGION=`aws configure get region`;
AWS_ACCOUNT_ID=`aws sts get-caller-identity | jq -r '.Account'`;
aws s3 cp --recursive <local-path-to-source-docs> s3://comprehend-semi-structured-docs-${AWS_REGION}-${AWS_ACCOUNT_ID}/<source-folder-name>/
Prerequisites:
pipenv shell
. You should see something like: Shell for /Users/<user>/.local/share/virtualenvs <package_name>-zsZ94mSG already activated. No action taken to avoid nested environments.
Otherwise, run make bootstrap
to enter the pipenv shell.comprehend-ssie-annotation-tool-cli.py
under the bin/ directory is a simple wrapper command that can be used streamline the creation of SageMaker GroundTruth Labeling Job. Under the hood, this CLI script will read the source documents from the S3 path specified as an argument, create a corresponding input manifest file with a single page of one source document per line, and the input Manifest file is then used as input to the Labeling Job. In addition, you also provide a list of Entity types that you define which will become the visible types in GroundTruth UI for Annotators to label each page of the document. The following is an example of using the CLI to start a labeling job:
input-s3-path
: S3 Uri to the source documents you copied earlier in Upload Source Semi-Structured Documents to S3 bucket
cfn-name
: The name of the CloudFormation stack name entered in the Package and Deploy
step.work-team-name
: The workforce name created from [One-time setup] Create a Private Workforce for Future Labeling Jobs
region
: The AWS region. ex. us-west-2job-name-prefix
: The prefix to have for the Sagemaker GroundTruth labeling job (LIMIT: 29 characters). Note: Extra text will be appended to job name prefix, ex. -labeling-job-task-20210902T232116
, unless --no-suffix
is used (See Additional customizable options (3)).entity-types
: The entities you would like to use during the labeling job (separated by commas)AWS_REGION=`aws configure get region`;
AWS_ACCOUNT_ID=`aws sts get-caller-identity | jq -r '.Account'`;
python bin/comprehend-ssie-annotation-tool-cli.py \
--input-s3-path s3://comprehend-semi-structured-docs-${AWS_REGION}-${AWS_ACCOUNT_ID}/<source-folder-name>/ \
--cfn-name sam-app \
--work-team-name <private-work-team-name> \
--region ${AWS_REGION} \
--job-name-prefix "${USER}-job" \
--entity-types "EntityTypeA, EnityTypeB, EntityTypeC"
The job has now been created and can be accessed the Sagemaker GroundTruth labeling portal.
For more information about the CLI options, use the -h
option. e.g.
python bin/comprehend-ssie-annotation-tool-cli.py -h
Additional customizable options:
--use-textract-only
flag to instruct the annotation tool to only use Amazon Textract AnalyzeDocument API to parse the PDF document. By default, the tool tries to auto-detect what types of source PDF document format is, and use either PDFPlumber or Textract to parse the PDF Documents. --annotator-metadata
parameter to reveal key-value information to annotators. Default metadata about the document is already revealed to the annotator within the UI side panel.--no-suffix
parameter to indicate to only use --job-name-prefix
for the job name. By default, a unique ID is suffixed to the --job-name-prefix
.--task-time-limit
parameter to indicate a desired time limit in seconds for each task in the Sagemaker GroundTruth labeling job. By default, the time limit is set to 3600 seconds or 1 hour.--input-s3-path
and --entity-types
from the labeling job creation script, and include --blind1-labeling-job-name
which should be the name of the previous annotation's job name.
AWS_REGION=`aws configure get region`;
AWS_ACCOUNT_ID=`aws sts get-caller-identity | jq -r '.Account'`;
python bin/comprehend-ssie-annotation-tool-cli.py \
--cfn-name sam-app \
--work-team-name <private-work-team-name> \
--region ${AWS_REGION} \
--job-name-prefix "${USER}-job" \
--blind1-labeling-job-name <sagemaker-blind1-labeling-job-name>
input-s3-path
and entity-types
and include blind1-labeling-job-name
and blind2-labeling-job-name
. The verification job will use the document and entity types from the first annotation (blind1) job.
AWS_REGION=`aws configure get region`;
AWS_ACCOUNT_ID=`aws sts get-caller-identity | jq -r '.Account'`;
python bin/comprehend-ssie-annotation-tool-cli.py \
--cfn-name sam-app \
--work-team-name <private-work-team-name> \
--region ${AWS_REGION} \
--job-name-prefix "${USER}-job" \
--blind1-labeling-job-name <sagemaker-blind1-labeling-job-name> \
--blind2-labeling-job-name <sagemaker-blind2-labeling-job-nam>
--task-availability-time-limit
parameter to indicate a desired time limit in seconds for the Sagemaker GroundTruth labeling job. By default, the time limit is set to 864000 seconds or 10 days.--only-include-expired-tasks
flag in conjunction with --blind1-labeling-job-name
to indicate to only include tasks which had expired (only available for use in a verification job).