aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
388 stars 142 forks source link

KeyError: 'Text' - on documents with tables #343

Open dzmitry-kankalovich opened 6 months ago

dzmitry-kankalovich commented 6 months ago

Hello,

I have a fairly normal looking document (for which I unfortunately cannot share original file as its a proprietary doc) that textractprettyprinter.t_pretty_print.get_text_from_layout_json fails to parse with KeyError: 'Text'.

We've traced it to the following problem:

The document in question contains a screenshot of a table, that has a selection in one of the cells:

problem_root_cause

This in turn is suspected to trigger an error at this line:

File "/app/.venv/lib/python3.11/site-packages/textractprettyprinter/t_pretty_print_layout.py", line 111, in _dfs
    cell_text = " ".join([id2block[line_id]['Text'] for line_id in cell_block["Relationships"][0]['Ids']])

If we inspect the root cause (I've added the try-catch to the original source file): error

It appears that the branch of code in _dfs() function that handles tables should add a check for the blocks that cell is referencing that they actually contain Text property (or alternatively use something like .get('Text',''))

dzmitry-kankalovich commented 6 months ago

Opened PR with a fix: https://github.com/aws-samples/amazon-textract-textractor/pull/344