aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
360 stars 134 forks source link

Table cell, incorrectly, does not pick up the cell text/words. Page--> Line picks up the words as in the textract output #358

Open raidken opened 2 months ago

raidken commented 2 months ago

59766-textract-table.json In the Textract output file Cell id 3f98227c-2981-4cd5-b23c-bee82e96bb54 references three words but the code below returns null words in that cell.

document= Document.open("c:\temp\59766-textract-table.json")

query for the line id that references that same three words

for line in document.pages[6].lines:

line_list =list(filter(lambda line: line.id=="3f98227c-2981-4cd5-b23c-bee82e96bb54",document.pages[6].lines)) print (line_list[0].words)

return the three words [Operating, Segment, Information]

cell in the textract output references the same three words but the words or text returns null, incorrectly, for the cell.

table_n = document.pages[6].tables[1]

find cell and output words

for cell in table_n.table_cells: if cell.id=="c23b7b9e-7b90-42d4-ad94-41caa8931417": print(cell.words)

return null

Belval commented 1 month ago

I am able to reproduce the issue, could you provide the original document for that response? It would make it easier to troubleshoot.