aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
220 stars 96 forks source link

Error KeyError: 'TABLE_TITLE' #153

Open OGiesecke opened 1 year ago

OGiesecke commented 1 year ago

I run:

j = call_textract(input_document=f"{awspath}/images/{newfile}_table.jpeg", features=[Textract_Features.TABLES])
# the t_doc will be not ordered
t_doc = TDocumentSchema().load(j)
# the ordered_doc has elements ordered by y-coordinate (top to bottom of page)
ordered_doc = order_blocks_by_geo(t_doc)
# send to trp for further processing logic
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))

And get the following error:

  File "/var/folders/6t/kcngxw3s50z4zg416dhcckjc0000gn/T/ipykernel_59247/1703830676.py", line 1, in <cell line: 1>
    t_doc = TDocumentSchema().load(j)

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/marshmallow/schema.py", line 719, in load
    return self._do_load(

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/marshmallow/schema.py", line 892, in _do_load
    result = self._invoke_load_processors(

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/marshmallow/schema.py", line 1090, in _invoke_load_processors
    data = self._invoke_processors(

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/marshmallow/schema.py", line 1220, in _invoke_processors
    data = processor(data, many=many, **kwargs)

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/trp/trp2.py", line 848, in make_tdocument
    return TDocument(**data)

  File "<string>", line 14, in __init__

  File "/Users/olivergiesecke/opt/anaconda3/envs/labelstudioenv/lib/python3.9/site-packages/trp/trp2.py", line 468, in __post_init__
    self._block_id_maps[blk.block_type][blk.id] = blk_i

KeyError: 'TABLE_TITLE'
athewsey commented 4 months ago

For the immediate error, It appears that _block_id_maps only gets initialized for the block types that are present in the document, which I believe is a bug because block_id_map(block_type) & block_map(block_type) are documented/exposed functions.

However, I suspect initialising the map alone won't solve the issue, because there must be some reason the loader is searching for a TABLE_TITLE block when the TDocument state hasn't seen any.

Would you be able to share a non-confidential document/image that reproduces this issue?

athewsey commented 4 months ago

_block_id_maps initialization was addressed in the linked PR and now released on PyPI v1.0.3.

I appreciate this issue was originally reported quite some time ago - If anybody's able to share a document that can reproduce it (or even better to test on v1.0.3+ and confirm whether it's helped) we can dive deeper. Otherwise, we'll probably close it out.