aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

How to get where a Table occurs in the document relative to Lines #162

Open sam-goodwin opened 12 months ago

sam-goodwin commented 12 months ago

Take, for example, the below table. It occurs after "Key: Value" and before "Another Line". I'd like to be able to process

Some Line
Key: Value

| Some Header  | Another Header |
| - | - |
| Some Value | Another Value

Another Line

I'd like to be able to iterate through a page and see the following items in order:

  1. Line ("Some Line")
  2. KeyValue ("Key", "Value")
  3. Table
  4. Line ("Another Line")

Is this possible?

schadem commented 12 months ago

Hi @sam-goodwin , are you referring to Python, JS or .NET?

sam-goodwin commented 12 months ago

JavaScript. I ended up doing it by sorting by the bounding box. Itd be nice if there were a way to traverse all the elements in the document in reading order, not just lines. But I understand that may not be easy to generalize in a way that works for all documents.

athewsey commented 11 months ago

Understand that this could be useful but agree it may be difficult to generalise... Since Textract's new native layout analysis feature may help give a more generalisable basis than our current pseudo-paragraph heuristic, I'll probably suggest to park this until we've tackled #164

athewsey commented 7 months ago

With today's release of amazon-textract-response-parser v0.4.0, users can run the source document through Amazon Textract with Layout analysis enabled and then use TRP.js to loop through the content elements which are returned in estimated reading order and can map to FORMS and TABLES items... Something like:

import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";

// layout.listItems() are implicitly in human reading order:
page.layout.listItems().forEach((layItem) => {
  if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
    // *Usually* multiple K-V fields per LayoutKeyValue block:
    const fields = (layItem as LayoutKeyValue).listFields();
    fields.forEach((field) => console.log(field.key.text));
  } else if (layItem.blockType === ApiBlockType.LayoutTable) {
    // *Usually* just one table per LayoutTable block:
    const tables = (layItem as LayoutTable).listTables();
    tables.forEach((table) => console.log(table.nCells));
  } else {
    // Other items e.g. title, section header, paragraph, etc
    layItem.listTextLines().forEach((line) => console.log(line.text));
  }
});

I tentatively believe this should support the original use-case of linking from human reading order to not just LINEs of text but also other analyses' results - subject to the caveats that:

  1. It depends on the Textract-side LAYOUT analysis feature being enabled, and
  2. Due to the nature of the Layout analysis feature, there's no guarantee of an exact 1-to-1 correspondence from text LINEs in the LayoutTable to text in the Table.
    • A simple approach could be to only use layoutKeyValue.listFields() and layoutTable.listTables() and not pay any attention to what text LINEs those blocks contain.
    • Today we work around this in our LayoutTableGeneric.html() method by scanning through the LayoutTable's linked LINEs and WORDs, inserting the full representation of the TABLE wherever we first see an overlap, and then continuing the LINEs scan but omitting any content that's already been rendered.
    • ...But didn't expose this logic yet because it seemed a bit early/experimental.

@sam-goodwin (or others) it'd be great to hear if this method already solves your needs?

We could consider exposing some kind of LayoutTable.listXYZ() API that tries to do the .html()-like reconciliation and return a linear list of {TABLE, LINE, and/or WORD}? I'm just nervous about edge cases e.g. I think in some example docs I've even seen TABLEs that overlap with each other.

I probably wouldn't want to dive in to a big project extending the heuristic getLineClustersInReadingOrder to also account for tables/forms when they're present but Layout wasn't enabled: Because Layout should be the canonical source for Reading Order information on non-trivial docs as it's AI-powered and should usually perform much better than our TRP-side heuristics.