ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT

Risho92 commented 11 months ago

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
    DocumentLocation=(
        'S3Object': {
            'Bucket': Bucket,
            'Name': Name
            }
        },
        FeatureTypes=['LAYOUT','FORMS'],
        OutputConfig={
            'S3Bucket': S3Bucket,
            'S3Prefix': S3Prefix
        },
    KMSKeyId=KMSKeyId
)

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"

Cell In[13], line 4
    1 with open("1.json") as input_fp:
    2   TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
    691 def load(
    692     self,
    693     data: (
    (...)
    700 unknown: str | None = None,
    701 ):
    702         """Deserialize a data structure to an object defined by this schema's fields.
    703
    704         :param data: The data to deserialize.
    (...)
    720             if invalid data are passed.
    721         """
    722     return self._do_load(
    723         data, many=many, partial=partial, unknown=unknown, postprocess=True
    724     )

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
    907     exec = ValidationError(errors, data=data, valid_data=result)
    908     self.handle_error(exc, data, many=many, partial=partial)
    909     raise exc
    911 return result

ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.

Given below are the environment details

Operation System: Windows 11 Pro Python Version: 3.10.12

amazon-textract-caller==0.2.1 amazon-textract-pipeline-pagedimensions==0.0.9 amazon-textract-prettyprinter==0.1.8 amazon-textract-textractor==1.4.5 amazon-textract-response-parser==1.0.2 marshmallow==3.20.1 textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

Belval commented 11 months ago

If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.

Thanks

Risho92 commented 11 months ago

Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.

aws-samples / amazon-textract-response-parser

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169