aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

Support Layout API responses in TRP.js #164

Closed athewsey closed 7 months ago

athewsey commented 11 months ago

Amazon Textract has launched a new LAYOUT analysis capable of returning a range of layout features such as titles, paragraphs, headers, and footers.

Today TRP.js provides basic heuristic options for sorting text (paragraphs) into reading order and segmenting headers and footers from main content.

We should extend TRP.js to make use of Amazon Textract's native layout analysis where possible, and maybe(?) keep these old heuristic methods around in case users want to continue using them to save on API costs or re-ingestion.

Not sure when I'll have chance to look at this yet, but raising here to reflect that it's on our radar. If you're waiting on this feature or have particular feedback on how you'd like it to work in the JS/TS version of TRP, please do let us know!

pags commented 11 months ago

I'd love the ability to take the raw JSON from the Textract API response provided by LAYOUT and turn it into an ordered csv like the one returned from the bulk document processor using the Textract console. As far as I can tell, using the console is the only way to get the stitched-together layout data.

kizaonline commented 10 months ago

This feature would be hugely beneficial to us as we are building out functionality to convert pdf's into our forms using Textract and need the LAYOUT analysis to interpret how inputs relate to each other

athewsey commented 7 months ago

Hi all and sorry for the wait - this turned out to be a bigger effort than initially expected!

A draft release is now published at v0.4.0-alpha.3 on NPM as detailed in the attached PR:

Please consider trying it out and sharing your feedback, so we can fix any important issues before moving to stable release!


For users looking to feed Textract results into LLMs in a semantic way, the hope is that doc.html() will generate useful HTML that reflects the analyzed structure to the extent we can.

(Super open to feedback if you disagree with the initial choices of what HTML elements/classes/etc to use for each type of element - I haven't had any chance to benchmark different options with different LLMs yet to find what performs best.)


CSVs as @pags requested are a bit of a sticky point at the moment because until feedback suggests otherwise, it seems like everybody would have different opinions about what exact columns they want to tabularize the data for their use-case?

IMO a core principle of TRP is that there's a huge amount of interrelated information available in Textract results - so builders need help to simplify the process of traversing the connections and materializing whatever view is important for different use-cases.

Hopefully it's at least nicely usable that you could now:

const layData = doc
  .pageNumber(1)
  .layout.listItems()
  .map((item, ixItem) => ({
    blockType: item.blockType,
    readingOrder: ixItem,
    confidence: item.confidence,
    text: item.text,
  }));
// ..Save in JSON/CSV/etc as desired
athewsey commented 7 months ago

amazon-textract-response-parser v0.4.0 is now released and we believe should address the core of this issue.

For specific extra follow-ups like a CSV export, or .markdown() as well as .html(), please raise separate issues for us to prioritise.

In addition, the new src-js/examples/ folder should give us a place to try and address more use-case-specific examples (like CSV) that users think would be useful but where it's not yet clear exactly how the library should standardise.

Thanks all for your feedback, and hope the new release is useful!