aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
212 stars 95 forks source link

Improve error messages for missing blocks when parsing incomplete JSON #150

Open kkhator-aws opened 1 year ago

kkhator-aws commented 1 year ago

Hi, My customer is receiving below error when using the textractor with a large multi-page pdf file.

899858907a773d1d5932a263c039a8fced6b281b0e716fbd31366bff7c4392c
Traceback (most recent call last):
  File "C:\Users\YADAVA66\PycharmProjects\pythonProject\main.py", line 80, in <module>
    doc = Document(response)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 633, in __init__
    self._parse()
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 667, in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 516, in __init__
    self._parse(blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 530, in _parse
    l = Line(item, blockMap)
  File "C:\Users\YADAVA66\PycharmProjects\Textract\lib\site-packages\trp\__init__.py", line 142, in __init__
    if(blockMap[cid]["BlockType"] == "WORD"):
KeyError: '5e06e009-03ac-42cc-9abf-4df8f606c2af'
schadem commented 1 year ago

This is no bug, instead the JSON passed to the trp is not complete and therefore missing an id that is referenced. Usually this happens when an asychronous API is called (Start*) and the result is paginated and then only the first JSON response block is used. Use the get_full_json_from_output_config or get_full_json from the https://pypi.org/project/amazon-textract-caller/ to get the full JSON object and pass that to the textract-response parser. Keeping this issue to remind me updating the error message and pointing to this and recommend getting the full JSON.