aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

When iterating over a layoutList, the page iterator moves on to layoutText after iterating over each individual list item #177

Closed rsanaie closed 3 months ago

rsanaie commented 4 months ago

When I'm traversing a layout object, after visiting a layout list, the iterator then moves on to individual list items, causing duplication of nodes visited. Is this intended?

This happens in .html() function as well. I know that you can use the childBlockIds of LayoutList to mark down which nodes have been visited and skip them later, but wanted to confirm this behaviour.

Thank you, you guys are awesome!

const ignoredBlocks: ApiBlockType[] = [
    ApiBlockType.LayoutHeader, 
    ApiBlockType.LayoutFooter,
    ApiBlockType.LayoutPageNumber
]

for (const page of doc.iterPages()) {
    for (const layItem of page.layout.iterItems()) {
        if (!ignoredBlocks.includes(layItem.blockType)) {
            console.log(layItem.blockType)
            console.log(layItem.text + "\n")
        }
    }
}
athewsey commented 3 months ago

Hi, thanks for raising this!

Need to double-check I understand correctly here, but at first pass this is sounding like something we should fix...

It seems like the current state is:

  1. page.layout.iterItems() (and the corresponding listItems()) includes both the LAYOUT_LIST and any LAYOUT_TEXT children it links to (because they all seem to appear as CHILD blocks of the parent PAGE, which is where this list is derived from)
  2. This is confusing for anybody iterating through layout items, because they'd probably assume their code should deal with the LAYOUT_LIST all at once when it's seen - like it's iterating the top level of a tree and not a flat collection of all content nodes
  3. It's also causing page.html() to duplicate content because the .html() representation of the LAYOUT_LIST already includes the LAYOUT_TEXT items that the iterator picks up next.

Is that matching what you're seeing @rsanaie?

IMO 3/ would definitely be a bug. Agreeing on how to better handle 2/ might involve a few more edge cases...

This single-level List->Text tree is the only structure I'm aware of in current layout results, but at the service API level it seems like there's nothing to prevent adding more in future or defining deep nesting or more general graphs. I'd be really surprised if the service ever started returning circular relationships, but probably wouldn't completely rule out the possibility of shared children one day.

As a general rule, I've also been trying to drive as much parsing and iteration behaviour as possible directly from block Relationship lists (minimizing internal state like LayoutGeneric._items) to unblock future mutability of loaded documents at the expense of some runtime performance for certain repeated read operations.

...So I'm initially thinking yes we should change the behaviour of the page.layout.iterItems() iterator, and probably even make the new behaviour the default, but long-term might need to be a bit more formal in defining e.g. single-level & recursive layout child iterators across the different kinds of LayoutItem? For e.g. maybe every layout item could have child iterator methods like iterLayoutItems({ deep = false, skipBlockIds = {}}) or something - which would today always return empty except for LAYOUT_LIST, and which the top-level page.layout.iterItems() could use (in deep mode) to prevent exposing any nested children

rsanaie commented 3 months ago

Hi @athewsey yes that's the correct behavior I'm seeing.

I agree with you that #3 is a bug that needs to be fixed as the .html() is supposed to output usable data, but currently it's duplicating unnecessarily.

Using a tree analogy makes sense, maybe we should have an iterator that iterates through all nodes in some specific order, and then another iterator that visits children only (and leverage recursion)?

athewsey commented 3 months ago

Hi @rsanaie

We have a draft bugfix release available at v0.4.1-alpha.1 to fix the serialization bug and provide some basic ability to:

  1. toggle page.layout.{list/iter}Items() between recursive and top-level-only behaviour
  2. access {list/iter}LayoutChildren() from any LayoutItem (although in practice today only LAYOUT_LIST should have anything, I think - rest will always be empty lists.

Maybe you'd be able to try it out and share thoughts?

Originally I was thinking we should make deep: false the default behaviour of page.layout.{iter/list}Items(), but maybe it's a breaking change for some folks... Might be worth revisiting for a v0.5 down the road?

rsanaie commented 3 months ago

It's working great as expected, thank you! It'd be ideal to specify which blocks should be skipped from html such as page number, header/footer) I'll create a new issue for that!

athewsey commented 3 months ago

🙇 Thanks for the feedback!

The fix is now released to mainline version v0.4.1 on NPM with the merge of #178, so closing this issue & will refer to #179 for the other ask 👍