aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
389 stars 142 forks source link

Merged rows yes, merged columns no #220

Open fascani opened 1 year ago

fascani commented 1 year ago

Hello, I am having an issue with merged columns and I realize the example in the documentation also suffers from this. If you look at the example of the "Consolidated Statement of Cash Flows" @ https://aws-samples.github.io/amazon-textract-textractor/notebooks/table_data_to_various_formats.html#Calling-Textract you will see that the columns "Three Month Ended June 30", "Six Month Ended June 30" and "Twelve Month Ended June 30" are split in the Excel even if I do believe the information is there to merge the column. (I am saying this because when I look at the relationship in the json file, I think I see you can link and merge cells between columns together.)

Is this a bug or is there a functionality to handle merged columns?

fascani commented 1 year ago

Hello, I am having an issue with merged columns and I realize the example in the documentation also suffers from this. If you look at the example of the "Consolidated Statement of Cash Flows" @ https://aws-samples.github.io/amazon-textract-textractor/notebooks/table_data_to_various_formats.html#Calling-Textract you will see that the columns "Three Month Ended June 30", "Six Month Ended June 30" and "Twelve Month Ended June 30" are split in the Excel even if I do believe the information is there to merge the column. (I am saying this because when I look at the relationship in the json file, I think I see you can link and merge cells between columns together.)

Is this a bug or is there a functionality to handle merged columns?

Better help oneself: I built a small Python helper package to merge columns correctly. See https://github.com/fascani/textract_json_to_df/tree/main

fascani commented 1 year ago

A follow-up on this: The json file from Textract itself is correct but it is the functionalities from textractor to create CSV that has issues with merged COLUMNS.