aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
220 stars 96 forks source link

Better *InReadingOrder APIs #12

Closed athewsey closed 5 months ago

athewsey commented 3 years ago

Hi folks & thanks for your work maintaining TRP.

Using the tool to post-process Textract results, I find that the idea of the getLinesInReadingOrder function really useful... but the returned data model today is frustratingly unhelpful!

What I'd really like is methods that return the actual Line or Word objects (rather than just text), so I can still access things like the block IDs and geometries.

Today, the getTextInReadingOrder() method just returns a text string and the getLinesInReadingOrder() method returns a (particularly un-intuitive) list of [ColumnId, LineText] pairs.

  1. It doesn't make sense to me that just text instead of the full objects are returned, given the method name is getLines... and not e.g. getLineText...
  2. The concept of columns is an implementation detail of getLinesInReadingOrder() and should either be: a. Explicitly committed to by docstring and method renaming e.g. getLineTextsByColumn(), or b. Recognised as an internal heuristic and hidden from the output.

I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.

So would ask:

schadem commented 3 years ago

I just added a serializer/deserializer for the Textract JSON response with an example of ordering the items in the response transparent on an object and serializing back to the Textract JSON format (see https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md). Does that help?

athewsey commented 5 months ago

Closing this stale request: