aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
408 stars 146 forks source link

Loading Existing JSON Files from S3 #235

Open ccrosland opened 1 year ago

ccrosland commented 1 year ago

Per this example reference:

There are two ways to parse an existing JSON. The simplest one, reminiscent of PIL.Image.open() is Document.open() which takes either a path or file-like object and parses it automatically. The path can be an S3 path.

However, the following Lambda function:

from textractor.entities.document import Document
def lambda_handler(event, context):
    document = Document.open("s3://cc-lambda-textract/textract_output/988e569c7e520e2376e6dda52c93a1151d9d3d72980928cbb1d338a13972da8f/1")
    print(document)

    return {
        'statusCode': 200
    }

Results in the following Error:

Response

{
  "errorMessage": "Expecting value: line 1 column 1 (char 0)",
  "errorType": "JSONDecodeError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 6, in lambda_handler\n    document = Document.open(\"s3://cc-lambda-textract/textract_output/988e569c7e520e2376e6dda52c93a1151d9d3d72980928cbb1d338a13972da8f/1\")\n",
    "  File \"/opt/python/textractor/entities/document.py\", line 68, in open\n    return response_parser.parse(json.load(download_from_s3(client, fp)))\n",
    "  File \"/var/lang/lib/python3.8/json/__init__.py\", line 293, in load\n    return loads(fp.read(),\n",
    "  File \"/var/lang/lib/python3.8/json/__init__.py\", line 357, in loads\n    return _default_decoder.decode(s)\n",
    "  File \"/var/lang/lib/python3.8/json/decoder.py\", line 337, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n",
    "  File \"/var/lang/lib/python3.8/json/decoder.py\", line 355, in raw_decode\n    raise JSONDecodeError(\"Expecting value\", s, err.value) from None\n"
  ]
}

Wherein: "s3://cc-lambda-textract/textract_output/988e569c7e520e2376e6dda52c93a1151d9d3d72980928cbb1d338a13972da8f/1" is the first output saved by Textract using OutputConfig and is valid JSON

Trying another approach (looping the directory and trying to load each json file):

import boto3
import json
import os
from textractor.entities.document import Document

def lambda_handler(event, context):
    # Define the S3 bucket and folder where the Textract files are located
    bucket_name = 'cc-lambda-textract'
    folder_path = 'textract_output/988e569c7e520e2376e6dda52c93a1151d9d3d72980928cbb1d338a13972da8f/'

    # Create an S3 client
    s3_client = boto3.client('s3')

    # List all the files in the specified folder
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)
    files = [item['Key'] for item in response['Contents'] if item['Key'] != folder_path + '.s3_access_check']

    # Collect the JSON content from all files into a list
    merged_json_content = []
    for file_key in files:
        print(file_key)
        s3_object = s3_client.get_object(Bucket=bucket_name, Key=file_key)
        file_content = json.load(s3_object['Body'])
        print(file_content)
        document = Document.open(file_content)
        extracted_text = ' '.join([page.text for page in document.pages])
        #merged_json_content.append(file_content)

    return {
        'statusCode': 200,
        'message': f'Processed {len(files)} files.'
    }

Results in the following error:

Response
{
  "errorMessage": "{'StatusMessage': ['Field may not be null.'], 'NextToken': ['Field may not be null.'], 'Warnings': ['Field may not be null.'], 'Blocks': {0: {'RowSpan': ['Field may not be null.'], 'Confidence': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 1: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 2: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 3: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 4: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 5: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 6: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 7: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 8: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 9: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 10: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 11: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 12: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 13: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 14: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 15: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 16: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 17: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 18: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 19: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 20: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 21: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 22: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 23: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 24: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 25: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 26: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 27: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 28: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 29: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 30: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 31: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 32: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 33: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 34: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 35: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 36: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 37: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 38: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 39: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 40: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 41: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 42: {'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Relationships': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 43: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 44: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 45: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 46: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 47: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 48: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 49: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}, 50: {'RowSpan': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'Query': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'PageClassification': ['Unknown field.'], 'Hint': ['Unknown field.']}}}",
  "errorType": "ValidationError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 26, in lambda_handler\n    document = Document.open(file_content)\n",
    "  File \"/opt/python/textractor/entities/document.py\", line 63, in open\n    return response_parser.parse(fp)\n",
    "  File \"/opt/python/textractor/parsers/response_parser.py\", line 979, in parse\n    t_doc = TDocumentSchema().load(response)\n",
    "  File \"/opt/python/marshmallow/schema.py\", line 722, in load\n    return self._do_load(\n",
    "  File \"/opt/python/marshmallow/schema.py\", line 909, in _do_load\n    raise exc\n"
  ]
}

Can parse this file manually and it's a simple JSON file (attached zip to allow uploading to github). 1.zip

ccrosland commented 1 year ago

this might be user error. I did get this to work when using the pure textractor-response-parser as follows:

from trp import Document

doc = Document("s3 file contents")
for page in doc.pages:
    print(page.text)
maxx2097 commented 6 months ago

What is "s3 file contents" in this context? Im facing the same problem with parsing these multiple Json files into ONE document.

onejgordon commented 5 months ago

@maxx2097 & @ccrosland, have you had any success with this? I'm also trying to load multiple JSON outputs from a textract run by concatenating all the Blocks, but am getting marshmallow.exceptions.ValidationError when I run response_parser.parse(doc_data) on the resultant merged dictionary:

{'Blocks': {0: {'Confidence': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'RowSpan': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Query': ['Field may not be null.'], ...

Document(doc_data) seems to work, but I have a similar problem when I instantiate TGeoFinder with the same data.

maxx2097 commented 5 months ago

@onejgordon I succeeded with the following approach:

from textractor.entities.document import Document
from textractcaller.t_call import get_full_json_from_output_config, OutputConfig

txoc = OutputConfig(s3_bucket="your bucket", s3_prefix="prefix")
result = get_full_json_from_output_config(txoc, "jobid")
document = Document.open(result)
onejgordon commented 5 months ago

Thank you! I had replicated that dict sanitization but using the version in the textractor lib is way simpler.