aws-samples / amazon-comprehend-semi-structured-documents-annotation-tools

Other
24 stars 15 forks source link

Documents Appear to be "Read Only" #13

Closed jetsonearth closed 2 years ago

jetsonearth commented 2 years ago

Hi developers! I was able to create a labeling job and open the annotation tool to perform labeling. However, I could only annotate on one page of each document, while all the other pages of the same document is "read only". All of the entities would not necessarily appear on the same page, so I would like to have the ability to annotate every page of the same document. How can I fix that? Thanks!

Screen Shot 2022-07-12 at 9 27 45 AM
jetsonearth commented 2 years ago

Can someone please take a look and help? This is really urgent and I have been having troubles with the annotation tool... would really appreciate some assistance. :(

dnlen commented 2 years ago

The Read-Only page navigation is mainly used to for page context. Each labeling job task is a single page. When that page is complete, clicking Submit will load another page. All the pages will eventually be loaded one-by-one to annotate.

Note: Sagemaker GroundTruth may not load pages in sequential order.

jetsonearth commented 2 years ago

Thanks for your response @dnlen !

However, there is a more serious issue with the annotation tool - it is not able to let me select certain texts within a document. For instance, for this file, when I was selecting the text segment "Total Usage 95 kWh", I was only able to select "Total Usage" but not "95 kWh". For my particular task, the "95 kWh" is where the important information is, but the tool fails to allow to label it. Is there a fix for this?

If it only happens to this particular document, then it is fine; nevertheless, I am afraid that this would happen down the road, and it wouldn't be acceptable.

Again, thank you for your support and thank you and the other developers for putting so much work into engineering this annotation tool!

Screen Shot 2022-07-12 at 6 31 54 PM Screen Shot 2022-07-12 at 6 30 00 PM
dnlen commented 2 years ago

For your issue, by default, some text on PDF pages may be detected as part of an image which is not parsed. One thing you can do to parse as much text as possible is to use the --use-textract-only option when creating the labeling job which will flag us to use Textract's DetectDocumentText API to parse the PDFs.

Note that this could increase cost as Textract API calls are being made.

jetsonearth commented 2 years ago

@dnlen Got it, thank you!