aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

trp2.get_block_by_id #176

Open Shayndee opened 5 months ago

Shayndee commented 5 months ago

if there is no block returned from self.find_block_by_id, it raises ValueError no block for id and then fails to parse the rest of the page.

athewsey commented 5 months ago

I'm not deep on the Python version of the library, but from what I understand this may be by design... What's your expected behaviour for missing blocks referenced in the response, @Shayndee?

e.g.

  1. Raise a validation error at load/parse time?
  2. Gracefully ignore missing blocks at load/parse time, but raise an error when attempting to access them later?
  3. Gracefully ignore missing blocks altogether, wherever they're referenced?
athewsey commented 3 months ago

Following up on this after diving a bit deeper:

TDocument provides two alternative methods depending on your desired error handling behaviour:

From your original description, I understand the issue is that TRP throwing an error when trying to initially load/parse a JSON that references (i.e. somewhere in a block's Relationships) a block ID that does not exist?

I understand (unless @Belval wants to correct me) that this behaviour of throwing an error on loading a document with missing block(s) is by design and ability to nicely handle malformed JSON would be a feature request.

  1. If I'm right, could you help by sharing some extra details on what type of block is missing from your JSON / where it's referenced?
  2. If I'm wrong and you're seeing an actual bug with find_block_by_id itself throwing an error, please let us know!