aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
218 stars 95 forks source link

How to print all multi-column variable text in reading order #65

Open wilianuhlmann opened 2 years ago

wilianuhlmann commented 2 years ago

Hi I've been trying to extract only the text in reading order for multi-column cases with the code below. His problem is that the number of columns is manual. I've been trying to deploy the response-parser in this code but I couldn't, could you give an example of how to do it? What I've achieved so far with amazon-textracr-response-parser keeps mixing up the reading order.

# Document
s3BucketName = "your-bucket-name"
documentName = "your-image.png"

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            # Divide by the number of existing columns
            # manual input that I need to resolve to sort either with 1, 2, 3, 4 or 5 columns
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])
tb102122 commented 2 years ago

Can you add an example of document in order to reproduce the issue.

wilianuhlmann commented 2 years ago

Sure! This example the language is Brazilian Portuguese. In one pdf file can have serveral examples of columns. Its variable in my example, can have one column, two columns, three coluns and four or more. For this reson its necessary that the identification of the columns be automatic.

multcolumns_variable.pdf

one_column decreto_2012page-5 decreto_santospage-1

tb102122 commented 2 years ago

Hey @wilianuhlmann did you try to use the following package to get the text sorted. https://github.com/aws-samples/amazon-textract-textractor/blob/d7c6488d6a707647171641958dfe8f05d6ffbc62/src/trp.py#L526

wilianuhlmann commented 2 years ago

I think I'm doing something wrong. I try this code:

json = 'decreto.json'
with open(json) as j:
    all_json = j.read()

def getLinesInReadingOrder(self):
    starting_point_tolerance = 0.01
    height_tolerance = 3
    same_line_top_tolerance = 0.001
    same_line_spacing_tolerance = 5
    columns = []
    lines = []
    for item in self._lines:
        column_found = False
        for index, column in enumerate(columns):
            bbox_left = item.geometry.boundingBox.left
            bbox_right = item.geometry.boundingBox.left + item.geometry.boundingBox.width
            bbox_centre = item.geometry.boundingBox.left + item.geometry.boundingBox.width / 2
            bbox_top = item.geometry.boundingBox.top
            bbox_height = item.geometry.boundingBox.height

            # new logic:  
            # if the starting point is within starting_point_tolerance (first_condition) and 
            # the top location is within height_tolerance * bbox_height (second_condition), or
            # the new line appeared to be broken by Textract mistake and should be of the same line 
            # by looking at the top (third_condition) and 
            # the left of the new line appears right next to the right of the last line (fourth_condition)
            # then consider the new line as part of said column
            first_condition = abs(bbox_left - column['left']) < starting_point_tolerance
            second_condition = abs(bbox_top - column['top']) < height_tolerance * bbox_height
            third_condition = abs(bbox_top - column['top']) < same_line_top_tolerance  # appeared to be in the same line
            fourth_condition = abs(bbox_left - column['right']) < same_line_spacing_tolerance * starting_point_tolerance
            if (first_condition and second_condition) or (third_condition and fourth_condition):
                # Bbox appears inside the column
                lines.append([index, item.text])
                # update the top and right with the new line added.
                columns[index]['top'] = bbox_top
                columns[index]['right'] = bbox_right
                column_found = True
                break
        if not column_found:
            columns.append({'left': item.geometry.boundingBox.left,
                            'right': item.geometry.boundingBox.left + item.geometry.boundingBox.width,
                            'top': item.geometry.boundingBox.top})
            lines.append([len(columns) - 1, item.text])

    lines.sort(key=lambda x: x[0])
    return lines

getLinesInReadingOrder(all_json)

But it returns error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [6], in <cell line: 51>()
     48     lines.sort(key=lambda x: x[0])
     49     return lines
---> 51 getLinesInReadingOrder(all_json)

Input In [6], in getLinesInReadingOrder(self)
     12 columns = []
     13 lines = []
---> 14 for item in self._lines:
     15     column_found = False
     16     for index, column in enumerate(columns):

AttributeError: 'str' object has no attribute '_lines'

My json file:

decreto.json.zip

tb102122 commented 2 years ago

You missed the conversion step to the Class Object.


from trp import Document
import json

json_path = "C:\GIT\cardess-email-lead-automation\decreto.json"
# Opening JSON file
with open(json_path) as json_file:
    textract_result = json.load(json_file)

#convert Textract response to Class Object
t_doc = Document(textract_result)
#show result of one page
page_ordered = t_doc.pages[0].getLinesInReadingOrder()
print(page_ordered)
wilianuhlmann commented 2 years ago

Thanks for helping me with the code. But i don't have success with the extract. See in this print that it changes the reading order It happens in any language

2022-06-22_23-36

The original img

decreto_pg_1_manualp

tb102122 commented 2 years ago

I think the approach works but you have to extend it due to the footer, headers and the changes in alignment. Sample image:

Screenshot 2022-06-23 131315

Output: [[0, 'III - Secretaria Municipal de Cultura;'], [0, 'IV - Secretaria Municipal de Educação;'], [0, 'V - Secretaria Municipal de Saúde;'], [0, 'VI - Secretaria Municipal de Empreendedoris-'], [0, 'mo, Economia Criativa e Turismo;'], [0, 'VII - Secretaria Municipal de Desenvolvimento'], [0, 'Social;'], [0, 'VIII - Secretaria Municipal de Segurança;'], [0, 'IX - Secretaria Municipal de Esportes;'], [0, 'X - Ouvidoria, Transparência e Controle;'], [0, 'XI - Secretaria Municipal de Gestão.'], [0, '§ 1° Os membros titulares e suplentes do Grupo'], [0, 'Técnico de Trabalho serão os chefes de departa-'], [0, 'mento, coordenadores, técnicos ou funcionários'], [0, 'indicados pelas Secretarias mencionadas neste'], [0, 'artigo.'], [0, '§ 2° o coordenador do Grupo Técnico de Traba-'], [0, 'lho poderá convidar para participar das reuniões,'], [0, 'representantes da Administração Pública direta e'], [0, 'indireta, federal, estadual e de outros órgãos da'], [0, 'administração municipal.'], [0, '§ 3° A Secretaria Executiva do Grupo Técnico de'], [0, 'Trabalho será exercida pelo Departamento de Ci-'], [0, 'dadania da Secretaria Municipal de Governo.'], [0, 'Art. 4° Os membros indicados pelas Secretarias'], [0, 'Municipais mencionadas no artigo anterior serão'], [0, 'nomeados pelo Secretário Municipal de Governo,'], [0, 'por meio de portaria específica.'], [0, 'Art. 5° As funções exercidas pelos membros'], [0, 'do Grupo Técnico de Trabalho constituído por'], [0, 'este decreto não serão remuneradas, sendo po-'], [0, 'rém consideradas como de relevante interesse'], [0, 'público.'], [0, 'Art. 6° Fica revogado o Decreto n° 6.116, de 27'], [0, 'de abril de 2012.'], [0, 'Art. 7° Este decreto entra em vigor na data da'], [0, 'publicação.'], [1, 'DECRETO N° 9.674'], [1, 'DE 05 DE MAIO DE 2022'], [1, 'NOMEIA o PRESIDENTE E o VICE-PRESIDENTE DO'], [1, 'CONSELHO DE DEFESA DO PATRIMÔNIO CULTURAL'], [1, 'DE SANTOS - CONDEPASA, E DA OUTRAS PROVI-'], [1, 'ROGÉRIO SANTOS, Prefeito Municipal de San-'], [1, 'tos, no uso das atribuições que lhe são conferidas'], [1, 'por lei, e em conformidade com o disposto no ar-'], [1, 'tigo 4°, "caput" e parágrafo 1°, da Lei n° 753, de 08'], [1, 'DECRETA:'], [1, 'Art. 1° Ficam nomeados Presidente e Vice-Presi-'], [1, 'dente do Conselho de Defesa do Patrimônio Cul-'], [1, 'tural de Santos - CONDEPASA, respectivamente,'], [1, 'o Engenheiro Marcio Borchia Nacif e a Arquiteta'], [1, 'Art. 2° Este decreto entra em vigor na data da'], [1, 'Palácio "José Bonifácio", em 05 de maio de 2022.'], [1, 'ROGÉRIO SANTOS'], [1, 'PREFEITO MUNICIPAL'], [1, 'Registrado no livro competente.'], [1, 'Departamento de Registro de Atos Oficiais do'], [1, 'Gabinete do Prefeito Municipal, em 05 de maio de'], [1, 'RODRIGO SALES'], [1, 'CHEFE DO DEPARTAMENTO'], [1, 'DECRETO N° 9.675'], [1, 'DE 05 DE MAIO DE 2022'], [2, 'DÊNCIAS.'], [2, 'de julho de 1991,'], [2, 'publicação.'], [2, '2022.'], [3, 'Fernanda Rodrigues Alarcon.'], [3, 'Registre-se e publique-se.']]

You can maybe try to play with the tolerance parameters from the function a bit.

https://github.com/aws-samples/amazon-textract-textractor/blob/d7c6488d6a707647171641958dfe8f05d6ffbc62/src/trp.py#L526-L530