aws-samples / amazon-comprehend-semi-structured-documents-annotation-tools

Other
24 stars 15 forks source link

Duplicate documents when starting a labeling job on SageMaker GroundTruth #12

Closed jetsonearth closed 2 years ago

jetsonearth commented 2 years ago

Hi! Love the great work you guys had put into this project :) I am running into something weird: when I run the annotation tool cli python file from my terminal, I was able to create the labeling job; however, upon checking the job status at the SageMaker UI, I saw that the labeling job includes duplicate documents, making the dataset objects go from 284 to 830. The number of documents in the S3 bucket that I store data is only 284, but the labeling job has 830 documents. I am pretty befuddled... any idea why this happened?

This is the command I typed to invoke the labeling job:

AWS_REGION=aws configure get region; AWS_ACCOUNT_ID=aws sts get-caller-identity | jq -r '.Account'; python bin/comprehend-ssie-annotation-tool-cli.py \ --input-s3-path s3://comprehend-semi-structured- docs-${AWS_REGION}-${AWS_ACCOUNT_ID}/electricity_training_data \ --cfn-name eb-ner-pipeline \ --work-team-name unstructured-dt-team \ --region ${AWS_REGION} \ --job-name-prefix "${USER}-job" \ --entity-types “Entity, ActivityStartDate, ActivityEndDate, EnergyUsed, EnergyUOM”

jetsonearth commented 2 years ago

See how these two files have 3 to 4 duplicates?

Screen Shot 2022-07-11 at 5 14 24 PM
dnlen commented 2 years ago

During labeling in the Sagemaker GroundTruth UI, each task is a single page, so each of the dataset objects refer to a page within a document. For example, from your screenshot, the 4 ENGIE_Sample Bill Template_Blank1.pdf dataset objects most likely refer to the 4 pages in that document.

jetsonearth commented 2 years ago

@dnlen got it, thank you for the response!