aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

`get_blocks_by_type` does not correctly handle pages without relationships (e.g. blank pages) (Python) #155

Closed MattExact closed 3 months ago

MattExact commented 1 year ago

If you call TDocument.get_blocks_by_type on a page with no relationships it will instead return as if you called it for the whole document. E.g. Calling TDocument.tables on a blank page will return all tables in the document. I believe this is unwanted and unintended behaviour.

This is due to the condition if page and page.relationships:. In the case of no relationships for the page, the condition evaluates to False. So instead the blocks returned are for the whole document.

TDocument.relationships_recursive is used to get the list of blocks on the page, which should handle when the page block has no relationships. Therefore I think this condition can just be if page:?

https://github.com/aws-samples/amazon-textract-response-parser/blob/541c07a12d603deed70699357f865d6974369c7b/src-python/trp/trp2.py#L660-L680