aws-samples / amazon-textract-enhancer

This workshop demonstrates how to build a Document parser and query engine with Amazon Textract and other services, such as ElasticSearch and DynamoDB.
MIT No Attribution
66 stars 34 forks source link

KeyError: 'Relationships' thrown when parsing Textract page results in textract_util.py #4

Open matwerber1 opened 5 years ago

matwerber1 commented 5 years ago

Hi,

I'm running the demo project with a PDF and receiving the error below. I'm working on investigating, but wanted to open the issue for tracking.

START RequestId: 0b976833-8688-42b4-9865-fe3e2955f984 Version: $LATEST
1 messages recieved
JobId = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Status = SUCCEEDED
Timestamp = 1567367135
API = StartDocumentTextDetection
JobTag = TextractTextDetectionJob-e925e04aeb58efc79afb74f5c6953cc2
S3ObjectName = My Test.pdf
S3Bucket = 544941453660-scanned-documents
upload_prefix = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 890 Blocks from Textract Text Detection response
5890 Blocks retrieved
Extracted Block Types:
PAGE = 10
LINE = 2353
WORD = 3527
Page-1 contains 184 Lines
Page-2 contains 388 Lines
Page-3 contains 487 Lines
Page-4 contains 251 Lines
Page-5 contains 262 Lines
Page-6 contains 352 Lines
Page-7 contains 371 Lines
Page-8 contains 23 Lines
Page-9 contains 35 Lines
'Relationships': KeyError
Traceback (most recent call last):
File "/var/task/detect-text-postprocess-page.py", line 74, in lambda_handler
document_text, num_lines = extractTextBody(blocks)
File "/var/task/textract_util.py", line 423, in extractTextBody
print("Page-
{}
contains
{}
Lines".format(page['Page'], len(page['Relationships'][0]['Ids'])))
KeyError: 'Relationships'

END RequestId: 0b976833-8688-42b4-9865-fe3e2955f984
matwerber1 commented 5 years ago

Found the issue...

Function extractTextBody() in textract_util.py assumes that a scanned page will have child blocks (e.g. LINES, WORDS) in the relationships property. However, if a scanned document has a blank page, no relationships key will be present.

Here's an example of the PAGE block throwing my error from a blank scanned page:

{  
   'BlockType':'PAGE',
   'Geometry':{  
      'BoundingBox':{  
         'Width':1.0,
         'Height':1.0,
         'Left':0.0,
         'Top':0.0
      },
      'Polygon':[  
         {  
            'X':1.0,
            'Y':0.622184693813324
         },
         {  
            'X':0.3762473165988922,
            'Y':1.0
         },
         {  
            'X':0.0,
            'Y':0.3771035075187683
         },
         {  
            'X':0.6218673586845398,
            'Y':0.0
         }
      ]
   },
   'Id':'25bfd28a-ddea-40db-9d90-9c3237510dc5',
   'Page':10
}

In my example, I am testing with docs provided by financial institutions that occasionally have blank pages at the end of a document or between sections. There's no easy way to avoid these blank pages so suggestion is that the extractTextBody() be modified to check whether the relationships key is present before proceeding.