Better *InReadingOrder APIs

athewsey commented 3 years ago

Hi folks & thanks for your work maintaining TRP.

Using the tool to post-process Textract results, I find that the idea of the getLinesInReadingOrder function really useful... but the returned data model today is frustratingly unhelpful!

What I'd really like is methods that return the actual Line or Word objects (rather than just text), so I can still access things like the block IDs and geometries.

Today, the getTextInReadingOrder() method just returns a text string and the getLinesInReadingOrder() method returns a (particularly un-intuitive) list of [ColumnId, LineText] pairs.

It doesn't make sense to me that just text instead of the full objects are returned, given the method name is getLines... and not e.g. getLineText...
The concept of columns is an implementation detail of getLinesInReadingOrder() and should either be: a. Explicitly committed to by docstring and method renaming e.g. getLineTextsByColumn(), or b. Recognised as an internal heuristic and hidden from the output.

I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.

So would ask:

How open/resistant would we be to making breaking changes to the existing getLinesInReadingOrder API? to try and bring the naming and functionality closer together?
What's the perspective on documents with more advanced not-quite-columns structure: Is the raw order of tokens output from Textract likely to be a better approximation of the reading order? Is there appetite to develop more sophisticated rules in TRP or not really as the complexity makes it a bit of a losing battle?

schadem commented 3 years ago

I just added a serializer/deserializer for the Textract JSON response with an example of ordering the items in the response transparent on an object and serializing back to the Textract JSON format (see https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md). Does that help?

athewsey commented 5 months ago

Closing this stale request:

In JS/TS we've had heuristic reading order extraction APIs for a while already
Across both JS and Python, the canonical best way to extract content in reading order is now to run Textract with the layout feature enabled and refer to the returned order of layout objects to guide reading order (which also includes useful metadata like paragraph segmentation, headings, etc).

aws-samples / amazon-textract-response-parser

Better *InReadingOrder APIs #12