aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
389 stars 142 forks source link

The key property of the KeyValue class does not return Line instance #317

Open oonisim opened 7 months ago

oonisim commented 7 months ago

KeyValue class key property says that it returns Line.

Return type. Line

However, it returns Python List[Word] and there is no text property available which the Line class to have.

Environment

import textractor
textractor.__version__
-----
'1.7.4'

from platform import python_version
print(python_version())
-----
3..10.10

Reproduction

import json
import os
import pathlib
from PIL import Image
from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.entities.line import Line
from textractor.entities.value import Value
from textractor.entities.word import Word
from textractor.entities.key_value import KeyValue
from textractor.visualizers.entitylist import EntityList

import textractor
textractor.__version__

from platform import python_version
print(python_version())

DATA_DIR=pathlib.Path.home().joinpath("home/repository/data/ml/medical_report/pdf")
FILEPATH=DATA_DIR.joinpath("MedicalExaminerReportExample_13.pdf")

extractor = Textractor(profile_name="eml-ap-southeast-2")
document = extractor.analyze_document(
    file_source=str(FILEPATH),
    features=[
        TextractFeatures.LAYOUT, 
        TextractFeatures.FORMS, 
        TextractFeatures.TABLES
    ],
    save_image=True,  # To use images property and visualize of the document instance.
)

document.key_values
-----
[Number : 24043,
 Year: : 2012,
 Decedent: : Martin, Trayvon,
 ECC contacted FI Malphurs of an apparent death in Sanford in the courtyard behind Retreal View Circle. Person of contact (POC) was SPD Inv. Serino. POC advised of an unknown B/M who had been shot by a resident of the complex POC stated the following: : ,
 On : 02/28/2012,
 DOB: : 02/05/1995.,
 Page : 2 of 2]
forms: EntityList[KeyValue] = document.key_values
form: KeyValue = None

for form in forms:
    key: Line = form.key
    value: Value = form.value

    print(f"type of value is {type(value)}.")
    print(f"type of key is {type(key)}.")
    print(f"type of value is {type(value)}.")

    key_text: str = key.text
    value_text = ' '.join([
         word.text for word in value.words   
    ])

    print(
        f"key:{key_text}, value:{value_text}, page:{kv.page}, page_id:{kv.page_id}"
    )
-----
type of value is <class 'textractor.entities.value.Value'>.
type of key is <class 'list'>.
type of value is <class 'textractor.entities.value.Value'>.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[120], line 12
      9 print(f"type of key is {type(key)}.")
     10 print(f"type of value is {type(value)}.")
---> 12 key_text: str = key.text
     13 value_text = ' '.join([
     14      word.txt for word in value   
     15 ])
     17 print(
     18     f"key:{key_text}, value:{value_text}, page:{kv.page}, page_id:{kv.page_id}"
     19 )

AttributeError: 'list' object has no attribute 'text'

PDF

Belval commented 7 months ago

In this case the docstring is incorrect, Key is a list of Words as per the Textract API documentation https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html

I can make .words into an EntityList object so you may be able to call get_text() on it. Let me create a PR to update the docstring.

Belval commented 7 months ago

It still won't return a Line, but #320 should address your use case.

for kv in document.key_values:
    key = kv.key
    value = kv.value
    key_text = key.text
    value_text = value.text
oonisim commented 7 months ago

Thank you for the update.