Incorrect table cell word and line order

wessens commented 1 month ago

Hello, this issue seems very similar to #136 , but I just can't make it work: the word and line order inside table cells is not preserved when invoking the get_text method.

The json attached is a reslt of running Textract start_document_analysis with parameters [TextractFeatures.TABLES, TextractFeatures.LAYOUT].

When running

import json

import textractor
from textractor.entities.document import Document

j = json.load(open('../data/processed/6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json'))

doc = Document.open(j)
print(doc.tables[1].get_text())

print(textractor.__version__)

I get as output for example

...
of adolescent and girls

6.1.2.4 the Ensure
...

But the actual lines are "of adolescent girls and" and "6.1.2.4 Ensure the" and the line order is different.

Blocks seem fine and the child order in "Relationships" also seem correct.

What am i doing wrong?

6e2ab4b2a234e0410205db117803203a1be55a3fc766d56083c62512d71e556e.json

wessens commented 1 month ago

Sorry, I am using Textractor version 1.7.11

Belval commented 1 month ago

I'll try to reproduce the issue on our side and get back to you on this. Thanks!

aws-samples / amazon-textract-textractor

Incorrect table cell word and line order #369