aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract
Apache License 2.0
212 stars 95 forks source link

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Open Risho92 opened 8 months ago

Risho92 commented 8 months ago

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
        'S3Object': {
            'Bucket': Bucket,
            'Name': Name
            'S3Bucket': S3Bucket,
            'S3Prefix': S3Prefix

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
Cell In[13], line 4
    1 with open("1.json") as input_fp:
    2   TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\, in Schema.load(self, data, many, partial, unknown)
    691 def load(
    692     self,
    693     data: (
    700 unknown: str | None = None,
    701 ):
    702         """Deserialize a data structure to an object defined by this schema's fields.
    704         :param data: The data to deserialize.
    720             if invalid data are passed.
    721         """
    722     return self._do_load(
    723         data, many=many, partial=partial, unknown=unknown, postprocess=True
    724     )

File ~\.conda\envs\python310\lib\site-packages\marshmallow\, in Schema._do_load(self, data, many, partial, unknown, postprocess)
    907     exec = ValidationError(errors, data=data, valid_data=result)
    908     self.handle_error(exc, data, many=many, partial=partial)
    909     raise exc
    911 return result

ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.

Given below are the environment details

Operation System: Windows 11 Pro Python Version: 3.10.12

amazon-textract-caller==0.2.1 amazon-textract-pipeline-pagedimensions==0.0.9 amazon-textract-prettyprinter==0.1.8 amazon-textract-textractor==1.4.5 amazon-textract-response-parser==1.0.2 marshmallow==3.20.1 textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

Belval commented 8 months ago

If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at] and I'll take a look.


Risho92 commented 8 months ago

Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.