aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
407 stars 145 forks source link

Incorrect order of text layouts due to compare_bounding_box() used in group_elements_horizontally() #389

Open keitaf opened 3 months ago

keitaf commented 3 months ago

When I send a PDF with the following paragraph (which is a bit tilted, part of this PDF file)

image

and use Document.get_text(), I get the following text where the order of the lines are shuffled.

Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter. 

I debugged the code, and it looks like it's due to text_util.compare_bounding_box(), which is called from layout.group_elements_horizontally().

group_elements_horizontally() receives a list of elements, which are layout texts for this paragraph.

The first element has BoundingBox as x: 0.08591524511575699, y: 0.4836207926273346, width: 0.6273355484008789, height: 0.03193599358201027 and text as 'The Center for Devices and Radiological Health (CDRH) of the Food and Drug'.

The second element has BoundingBox as x: 0.08505144715309143, y: 0.5002045631408691, width: 0.6902255415916443, height: 0.03553390130400658 and text as 'Administration (FDA) has completed its review of your premarket approval application the'.

group_elements_horizontally() sorts the elements by using compare_bounding_box(), and due to the following block, compare_bounding_box() sorts the elements by x axis instead of y axis.

    if abs(ay_mid - by_mid) < delta:
        if a.bbox.x > b.bbox.x:
            return 1
        else:
            return -1

Because of that, the second element comes before the second element after the sort.

compare_bounding_box() was introduced in this commit, but it's unclear to me what was the heuristic behind the logic.

Could you please improve / fix the logic of compare_bounding_box(), and/or add an option to not use the heuristic and simply order the elements by y axis?

Belval commented 3 months ago

Thank you for sharing the problematic sample. I will need to reproduce the issue first, but the code snippet that you highlighted is used to reconcile lines and ensure that the words within a given line are ordered by their x, it should not result in what you are seeing even though compare_bounding_box is indeed the culprit.

This line https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/utils/text_utils.py#L20 creates new lines if the y distance of the center of word a to the center of word b is too high. This is likely what is happening here.

Belval commented 3 months ago

Same issue as #369

keitaf commented 3 months ago

Here is the Textract response JSON generated from this PDF file.

I can reproduce it by running the following code.

from textractor.entities.document import Document

document = Document.open('P010032A.pdf.json')
text = document.get_text()
print(text)
DEPARTMENT OF HEALTH & HUMAN SERVICES 

Public Health Service 
...
Dear Mr. Johnson: 

Administration (FDA) has completed its review of your premarket approval application the The Center for Devices and Radiological Health (CDRH) of the Food and Drug This device is indicated as an aid in the management of chronic intractable pain programmer, the Model 1232 programming wand and the Model 1210 patient magnet. of the following components: the Model 3608 pulse generator, the Model 3850 patient (PMA) for the Genesis Neurostimulation (IPG) System. The System includes trunk and/or limbs, including unilateral or bilateral pain associated with failed back that surgery the PMA is approved subject to the conditions described below and in the syndrome, intractable low back pain and leg pain. We are pleased to inform you "Conditions of Approval" (enclosed). You may begin commercial distribution of the device upon receipt of this letter. 
...