Open ccrosland opened 1 year ago
this might be user error. I did get this to work when using the pure textractor-response-parser
as follows:
from trp import Document
doc = Document("s3 file contents")
for page in doc.pages:
print(page.text)
What is "s3 file contents" in this context? Im facing the same problem with parsing these multiple Json files into ONE document.
@maxx2097 & @ccrosland, have you had any success with this? I'm also trying to load multiple JSON outputs from a textract run by concatenating all the Blocks
, but am getting marshmallow.exceptions.ValidationError
when I run response_parser.parse(doc_data)
on the resultant merged dictionary:
{'Blocks': {0: {'Confidence': ['Field may not be null.'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null.'], 'ColumnSpan': ['Field may not be null.'], 'EntityTypes': ['Field may not be null.'], 'RowIndex': ['Field may not be null.'], 'RowSpan': ['Field may not be null.'], 'SelectionStatus': ['Field may not be null.'], 'TextType': ['Field may not be null.'], 'Query': ['Field may not be null.'], ...
Document(doc_data)
seems to work, but I have a similar problem when I instantiate TGeoFinder with the same data.
@onejgordon I succeeded with the following approach:
from textractor.entities.document import Document
from textractcaller.t_call import get_full_json_from_output_config, OutputConfig
txoc = OutputConfig(s3_bucket="your bucket", s3_prefix="prefix")
result = get_full_json_from_output_config(txoc, "jobid")
document = Document.open(result)
Thank you! I had replicated that dict sanitization but using the version in the textractor lib is way simpler.
Per this example reference:
However, the following Lambda function:
Results in the following Error:
Response
Wherein: "s3://cc-lambda-textract/textract_output/988e569c7e520e2376e6dda52c93a1151d9d3d72980928cbb1d338a13972da8f/1" is the first output saved by Textract using
OutputConfig
and is valid JSONTrying another approach (looping the directory and trying to load each json file):
Results in the following error:
Can parse this file manually and it's a simple JSON file (attached zip to allow uploading to github). 1.zip