Open adityachandak287 opened 1 month ago
I've made the proposed changes here. I'd be happy to create a PR!
Also updated the min repro repo to use this latest version and the comparison between the fixed branch and main shows the change in the sample output, i.e. list content is not duplicated anymore.
Found a couple other cases where LAYOUT_LIST
had LAYOUT_SECTION_HEADER
and LAYOUT_TITLE
as its children. Created a PR with a fix that excludes all LAYOUT*
elements which are children of LAYOUT_LIST
elements.
Regarding that last message do you have an example of LAYOUT_LIST
containing LAYOUT_SECTION_HEADER
or LAYOUT_TITLE
?
Sure! I can't share the original document, but here's another sample document which recreates the LAYOUT_LIST
-> LAYOUT_SECTION_HEADER
scenario. JSON output for reference.
You can check this comparison to see the difference between ignoring text only v/s all layout children.
Current Behavior
While trying to create markdown or text files from AWS Textract JSON output using the
get_text_from_layout_json
function, the contents of ALL the list items are duplicated in the output.Expected Behavior
Each list item's contents should be included in the output only once.
Related Issues
274
Possible Solution
The AWS docs on Textract Layout Response Objects mention that in the case of
LAYOUT_LIST
elements, their children can point toLAYOUT_TEXT
elements, which is the case here.Due to this, when getting all layouts from the Textract JSON output (LinearizeLayout._get_layout_blocks), the
LIST_LAYOUT
as well as its childTEXT_LAYOUT
layout elements are included, which leads to the duplication in output text.The get_text_from_layout_json function is a wrapper over LinearizeLayout.get_text function which loops over all layouts (blocks with
LAYOUT.*
type) from the Textract JSON output and collects the text contents from their children blocks.The fix lies in the
LinearizeLayout._get_layout_blocks
function where we can exclude theLAYOUT_TEXT
elements which are children ofLAYOUT_LIST
elements.Steps to Reproduce
Minimal reproduction repo: adityachandak287/textractprettyprinter-list-duplication-bug-repro
The repository contains the following for reference:
Environment
``` amazon-textract-caller==0.2.4 amazon-textract-prettyprinter==0.1.10 amazon-textract-response-parser==0.1.48 boto3==1.35.6 botocore==1.35.6 ```Edit: Added related issues section.