Open sam-goodwin opened 1 year ago
Hi @sam-goodwin , are you referring to Python, JS or .NET?
JavaScript. I ended up doing it by sorting by the bounding box. Itd be nice if there were a way to traverse all the elements in the document in reading order, not just lines. But I understand that may not be easy to generalize in a way that works for all documents.
Understand that this could be useful but agree it may be difficult to generalise... Since Textract's new native layout analysis feature may help give a more generalisable basis than our current pseudo-paragraph heuristic, I'll probably suggest to park this until we've tackled #164
With today's release of amazon-textract-response-parser v0.4.0, users can run the source document through Amazon Textract with Layout analysis enabled and then use TRP.js to loop through the content elements which are returned in estimated reading order and can map to FORMS
and TABLES
items... Something like:
import { ApiBlockType, LayoutKeyValue, LayoutTable } from "amazon-textract-response-parser";
// layout.listItems() are implicitly in human reading order:
page.layout.listItems().forEach((layItem) => {
if (layItem.blockType === ApiBlockType.LayoutKeyValue) {
// *Usually* multiple K-V fields per LayoutKeyValue block:
const fields = (layItem as LayoutKeyValue).listFields();
fields.forEach((field) => console.log(field.key.text));
} else if (layItem.blockType === ApiBlockType.LayoutTable) {
// *Usually* just one table per LayoutTable block:
const tables = (layItem as LayoutTable).listTables();
tables.forEach((table) => console.log(table.nCells));
} else {
// Other items e.g. title, section header, paragraph, etc
layItem.listTextLines().forEach((line) => console.log(line.text));
}
});
I tentatively believe this should support the original use-case of linking from human reading order to not just LINE
s of text but also other analyses' results - subject to the caveats that:
LAYOUT
analysis feature being enabled, andLayoutTable
to text in the Table
.
layoutKeyValue.listFields()
and layoutTable.listTables()
and not pay any attention to what text LINEs those blocks contain.LayoutTable
's linked LINEs and WORDs, inserting the full representation of the TABLE wherever we first see an overlap, and then continuing the LINEs scan but omitting any content that's already been rendered.@sam-goodwin (or others) it'd be great to hear if this method already solves your needs?
We could consider exposing some kind of LayoutTable.listXYZ()
API that tries to do the .html()
-like reconciliation and return a linear list of {TABLE, LINE, and/or WORD}? I'm just nervous about edge cases e.g. I think in some example docs I've even seen TABLEs that overlap with each other.
I probably wouldn't want to dive in to a big project extending the heuristic getLineClustersInReadingOrder
to also account for tables/forms when they're present but Layout wasn't enabled: Because Layout should be the canonical source for Reading Order information on non-trivial docs as it's AI-powered and should usually perform much better than our TRP-side heuristics.
Take, for example, the below table. It occurs after "Key: Value" and before "Another Line". I'd like to be able to process
I'd like to be able to iterate through a page and see the following items in order:
Is this possible?