aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
360 stars 134 forks source link

Incorrect table cell word and line order #369

Open wessens opened 1 month ago

wessens commented 1 month ago

Hello, this issue seems very similar to #136 , but I just can't make it work: the word and line order inside table cells is not preserved when invoking the get_text method.

The json attached is a reslt of running Textract start_document_analysis with parameters [TextractFeatures.TABLES, TextractFeatures.LAYOUT].

When running

import json

import textractor
from textractor.entities.document import Document

j = json.load(open('../data/processed/6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json'))

doc = Document.open(j)
print(doc.tables[1].get_text())

print(textractor.__version__)

I get as output for example

...
of adolescent and girls

6.1.2.4 the Ensure
...

But the actual lines are "of adolescent girls and" and "6.1.2.4 Ensure the" and the line order is different.

Blocks seem fine and the child order in "Relationships" also seem correct.

What am i doing wrong?

6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json

wessens commented 1 month ago

Sorry, I am using Textractor version 1.7.11

Belval commented 1 month ago

I'll try to reproduce the issue on our side and get back to you on this. Thanks!