aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
407 stars 145 forks source link

Fix linearize layout when Block Entity Types are `None` #398

Closed BPDanek closed 2 weeks ago

BPDanek commented 1 month ago

Entity Types are occasionally None, causing the linearize layout to fail.

This may happen in cases where there are multiple page documents.

Issue #, if available:

Description of changes: Check if entity_types is None before attempting to iterate it. Otherwise returns "NoneType is not iterable".

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

andrewkowalik commented 1 month ago

Is there a better failure case that can be replicated? EntityTypes should only exist for Forms and Tables and I am wondering if this path is being reach incorrectly. Your specific change is in the context of a table, does the page in question have a table or should this code path not even be reachable in your test scenario.

Within a table I believe EntityTypes should always be populated. There is probably another issue at hand going on in your test case.

Belval commented 2 weeks ago

+1 to what @andrewkowalik wrote, EntityTypes should always exist in table output and is returned by the Textract Tables API since ~2 years ago. If you are processing older responses I would advise simply updating the response themselves to include EntityType but I don't see a need to support this in mainline.

Let me know if you have an example of a recent response that does not have the EntityType field as that would be a Textract bug. Thanks!