aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

Merge_cell call does not work as expected #166

Closed sawasume closed 10 months ago

sawasume commented 10 months ago

Hello I am using this library to parse the json generated by textract. I have a lot of tables where there are merge cell and I need to get information from such cell and then create csv file from the table

below is the piece of the code i am using

with open(table_case_3, 'r') as pfz_doc: pfz_textract_json=json.load(pfz_doc)

tdoc = Document(pfz_textract_json)

when I use this call to print the contents of a merge cell I get some text omitted from it

table.rows[i].cell[j].mergedText

Below is an image of cell by cell comparison of a table where the text of interst was 3 or 4 days but the above call only extracted 3 4

diff_between_mergedcell_text

Another example where the merged cell text was 'supplied by'

output

you can see in the image both cell 1 and 2 of row 0 is displaying the same word supplied using mergedText call

where as row 0 cell 2 is displaying the word by using .text call

my expectation is both cell 1 and cell 2 of row 0 should display supplied by using mergedText call

schadem commented 10 months ago

@sawasume can you share a JSON example?

sawasume commented 10 months ago

@schadem have send it internally via slack