aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

KeyError Geometry on Textract queries #141

Closed aarif1996 closed 1 year ago

aarif1996 commented 1 year ago

Traceback (most recent call last): File "/home/ubuntu/sample.py", line 52, in textract_output document = extractor.analyze_document( File "/usr/local/lib/python3.8/site-packages/textractor/textractor.py", line 438, in analyze_document document = response_parser.parse(response) File "/usr/local/lib/python3.8/site-packages/textractor/parsers/response_parser.py", line 906, in parse return parse_document_api_response(response) File "/usr/local/lib/python3.8/site-packages/textractor/parsers/response_parser.py", line 770, in parse_document_api_response queries = _create_query_objects( File "/usr/local/lib/python3.8/site-packages/textractor/parsers/response_parser.py", line 381, in _create_query_objects query_results = _create_query_result_objects( File "/usr/local/lib/python3.8/site-packages/textractor/parsers/response_parser.py", line 419, in _create_query_result_objects block["Geometry"]["BoundingBox"], spatial_object=page KeyError: 'Geometry'

anyaovi commented 1 year ago

I have a similar issue. If I get a straight answer, I do have coordinates. E.g. : What is the title of this doc? page1 However, if I get 'interpreted' answers e.g. What are the standards of this doc, page1: I have geometry set on None

query is TBlock(geometry=None, id='d1a1bac6-8c00-4b8b-91ef-72ff7d3398d9', block_type='QUERY', relationships=[TRelationship(type='ANSWER', ids=['d3c0611d-a7ba-48ed-9d4a-031e64a3d4f3'])], confidence=None, text=None, column_index=None, column_span=None, entity_types=None, page=1, row_index=None, row_span=None, selection_status=None, text_type=None, custom=None, query=TQuery(text='what are the standards of the certified weight?', alias='tc_certified_shipping_standards'))

rels is TRelationship(type='ANSWER', ids=['d3c0611d-a7ba-48ed-9d4a-031e64a3d4f3']) [TBlock(geometry=None, id='d3c0611d-a7ba-48ed-9d4a-031e64a3d4f3', block_type='QUERY_RESULT', relationships=None, confidence=43.0, text='GRS, GRS', column_index=None, column_span=None, entity_types=None, page=1, row_index=None, row_span=None, selection_status=None, text_type=None, custom=None, query=None)]

I have a quite big chunk of code depending on coordinates and for 5 months straight, I had no issue. I did check for having same other libraries related to Textract to the old version and tested on old git branches.

So, is this a new way Textract answers to questions?

schadem commented 1 year ago

@aarif1996 Your issue is with the textractor package, not the amazon-textract-response-parser.

@anyaovi : Does your text 'GRS, GRS' exist on the page or is it inferred? Queries may not include the coordinates when the text is inferred. You do not get an exception, correct?

schadem commented 1 year ago

I will close this one, aws-samples/amazon-textract-textractor#195 is the ticket for the KeyError: 'Geometry'