aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
221 stars 96 forks source link

KeyError exception in Python trp package when parsing a page that doesn't have a Polygon element #79

Open paultipper opened 2 years ago

paultipper commented 2 years ago

A 12 page PDF document was processed by Textract, and I'm trying to use this package to parse the resulting response.json. The very first is a PAGE block that has the following Geometry element:

{
    "DocumentMetadata": { "Pages": 12 },
    "JobStatus": "SUCCEEDED",
    "NextToken": "RYAd635ujGFqn4t5XLy4H+7BT1mguxFfHvBA8pGfJ3C9FnC8Pv7Cz/+qj+v/MisnIcNR7fwh+/CfJVGIdHn/sSplCQcE2ra4ZXjtDJ9SIp6Z9v5ICHmkzGNrVtS4m4GG",
    "Blocks": [
      {
        "BlockType": "PAGE",
        "Geometry": {
          "BoundingBox": {
            "Width": 1.0,
            "Height": 1.0,
            "Left": 0.0,
            "Top": 0.0
          }
        },
        "Id": "e5413485-55aa-405c-b547-25d6f3db1251",
       "...","...."
  }]}

I've loaded the response into a dictionary and then tried to instantiate the Document class, passing the document dictionary to the constructor; when I do so, I get the following error:

./tests/TextractOutputProcessor_test.py::test_processResponseJson Failed: [undefined]KeyError: 'Polygon'
responseJsonFile = './tests/textract/response.json'

    def test_processResponseJson(responseJsonFile):
        """Test the processResponseJson method"""

        assert isinstance(responseJsonFile, str)
        processor = TextractOutputProcessor()

        try:
>           processor.loadResponseJson(responseJsonFile)

tests/TextractOutputProcessor_test.py:17: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
TextractOutputProcessor.py:24: in loadResponseJson
    self.document = Document(self.metadata)
venv/lib/python3.8/site-packages/trp/__init__.py:638: in __init__
    self._parse()
venv/lib/python3.8/site-packages/trp/__init__.py:675: in _parse
    page = Page(documentPage["Blocks"], self._blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:522: in __init__
    self._parse(blockMap)
venv/lib/python3.8/site-packages/trp/__init__.py:533: in _parse
    self._geometry = Geometry(item['Geometry'])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <trp.Geometry object at 0x7fe2a06e1910>
geometry = {'BoundingBox': {'Height': 1.0, 'Left': 0.0, 'Top': 0.0, 'Width': 1.0}}

    def __init__(self, geometry):
        boundingBox = geometry["BoundingBox"]
>       polygon = geometry["Polygon"]
E       KeyError: 'Polygon'

venv/lib/python3.8/site-packages/trp/__init__.py:111: KeyError

It seems that the Geometry class expects there to be a Polygon element within every Geometry element in the response JSON, even though Textract did not create such an element when it processed my PDF document.

schadem commented 1 year ago

Can you share the document? You are correct, it is optional and should not be be accepted according to https://docs.aws.amazon.com/textract/latest/dg/API_Geometry.html, but I never saw one without Polygon, so that would be very interesting to see. @paultipper