Closed athewsey closed 5 months ago
I just added a serializer/deserializer for the Textract JSON response with an example of ordering the items in the response transparent on an object and serializing back to the Textract JSON format (see https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md). Does that help?
Closing this stale request:
Hi folks & thanks for your work maintaining TRP.
Using the tool to post-process Textract results, I find that the idea of the
getLinesInReadingOrder
function really useful... but the returned data model today is frustratingly unhelpful!What I'd really like is methods that return the actual
Line
orWord
objects (rather than just text), so I can still access things like the block IDs and geometries.Today, the
getTextInReadingOrder()
method just returns a text string and thegetLinesInReadingOrder()
method returns a (particularly un-intuitive) list of[ColumnId, LineText]
pairs.getLines...
and not e.g.getLineText...
getLinesInReadingOrder()
and should either be: a. Explicitly committed to by docstring and method renaming e.g.getLineTextsByColumn()
, or b. Recognised as an internal heuristic and hidden from the output.I also see that the column detection seems pretty simple as it's implemented so far and likely to do some weird things on documents like forms or posters that might have less vertically-static column layouts down the page.
So would ask:
getLinesInReadingOrder
API? to try and bring the naming and functionality closer together?