Open keitaf opened 3 months ago
Thank you for sharing the problematic sample. I will need to reproduce the issue first, but the code snippet that you highlighted is used to reconcile lines and ensure that the words within a given line are ordered by their x, it should not result in what you are seeing even though compare_bounding_box
is indeed the culprit.
This line https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/utils/text_utils.py#L20 creates new lines if the y distance of the center of word a to the center of word b is too high. This is likely what is happening here.
Same issue as #369
Here is the Textract response JSON generated from this PDF file.
I can reproduce it by running the following code.
from textractor.entities.document import Document
document = Document.open('P010032A.pdf.json')
text = document.get_text()
print(text)
DEPARTMENT OF HEALTH & HUMAN SERVICES
Public Health Service
...
Dear Mr. Johnson:
Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter.
...
When I send a PDF with the following paragraph (which is a bit tilted, part of this PDF file)
and use
Document.get_text()
, I get the following text where the order of the lines are shuffled.I debugged the code, and it looks like it's due to
text_util.compare_bounding_box()
, which is called fromlayout.group_elements_horizontally()
.group_elements_horizontally()
receives a list of elements, which are layout texts for this paragraph.The first element has
BoundingBox
asx: 0.08591524511575699, y: 0.4836207926273346, width: 0.6273355484008789, height: 0.03193599358201027
andtext
as'The Center for Devices and Radiological Health (CDRH) of the Food and Drug'
.The second element has
BoundingBox
asx: 0.08505144715309143, y: 0.5002045631408691, width: 0.6902255415916443, height: 0.03553390130400658
andtext
as'Administration (FDA) has completed its review of your premarket approval application the'
.group_elements_horizontally()
sorts the elements by usingcompare_bounding_box()
, and due to the following block,compare_bounding_box()
sorts the elements by x axis instead of y axis.Because of that, the second element comes before the second element after the sort.
compare_bounding_box()
was introduced in this commit, but it's unclear to me what was the heuristic behind the logic.Could you please improve / fix the logic of
compare_bounding_box()
, and/or add an option to not use the heuristic and simply order the elements by y axis?