aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
381 stars 138 forks source link

parse an existing JSON - from textract.start_document_analysis() throws AssertionError #188

Open sankalp-wns opened 1 year ago

sankalp-wns commented 1 year ago

I am parsing an existing JSON response from the asynchronous call - textract.start_document_analysis() but it fails to parse it. I have a multipage pdf. I get an AssertionError -

from textractor.parsers import response_parser
document = response_parser.parse(textract_response)
.
.
venv/lib/python3.9/site-packages/textractor/parsers/response_parser.py", line 733, in parse_document_api_response
    assert len(pages) == response["DocumentMetadata"]["Pages"]
AssertionError

On debugging the response_parser.py file, I found that it was able to identify just 4 pages whereas mine has 13.

Also, I tried with textract.analyze_document for each page in pdf, combined it - and it works perfectly. Please help.

Belval commented 1 year ago

Would you be willing to share the asset (or even just the Textract response) you used so that we are better able to reproduce the issue on our side? If so send an email to belvae[AT]amazon.com.

sankalp-wns commented 1 year ago

I have sent you an email marking this issue. Thanks.

Belval commented 1 year ago

Thank you for send me the assets, this helps a lot when debugging.

I was able to reproduce the issue using the .json file you provided, however I was not able to reproduce the issue by running Textract on the original PDF. Going through the json file it seems like there are indeed missing pages with the metadata reporting 13 pages and the actual file having only 4.

Could you share the code snippet that was used to produce the response you shared?

EDIT: Here is my code that did not reproduce the issue

from textractor import Textractor
from textractor.data.constants import TextractFeatures

doc = Textractor(
    region_name="us-west-2"
).start_document_analysis(
    "your_file.pdf",
    features=[TextractFeatures.TABLES],
    s3_upload_path="s3://my-bucket/my-prefix",
    save_image=False
)
json.dump(doc.response, open("sample.json", "w")) # That file does contain all 13 pages.
sankalp-wns commented 1 year ago

The textract service is called with the boto3 module. Below is a similar scratch code -


def start_document_analysis(bucket, document_key, role_arn, sns_topic_arn):
    textract = boto3.client('textract')
    document_location = {'S3Object': {'Bucket': bucket, 'Name': document_key}}
    notification_channel = {'RoleArn': role_arn, 'SNSTopicArn': sns_topic_arn}

    response = textract.start_document_analysis(
        DocumentLocation=document_location,
        FeatureTypes=['TABLES'],
        JobTag=  'test_Job',
        NotificationChannel=notification_channel
    )

    return response['JobId']

def get_document_analysis_results(bucket, job_id):
    textract = boto3.client('textract')
    response = textract.get_document_analysis(JobId=job_id)

    while response['JobStatus'] == 'IN_PROGRESS':
        print('Job status: IN_PROGRESS')
        time.sleep(5)
        response = textract.get_document_analysis(JobId=job_id)

    if response['JobStatus'] == 'FAILED':
        print(f"Job failed: {response['StatusMessage']}")
        return None

    return response

This is done by 2 lambdas which saves the ocr response as a .json file in s3 bucket. And then an another lambda would process this output.

sankalp-wns commented 1 year ago

Hi @Belval, were you able to look into this? I might have to change the implementation from async to sync as a work-around.

Righs commented 9 months ago

Hello @Belval,

I seem to have the same issue when writing the JSON output like this:

textractor.start_document_analysis(
        file_source="s3://mysource.pdf",
        features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES],
        s3_output_path="s3://mysink",
        save_image=True)

This will ASYNC call Textract and then storing it per call having the path "s3://mysink/1" for the first call. However, when reading a single JSON again, it throws the assertion error due to the fact it checks the Document Metadata and the actual pages in the load of one single JSON.

This is how I try to reload the object:

client = boto.client('s3')
result = download_from_s3(client, URI)
jresult = json.loads(result.getvalue().decode()) # This will btw also have None values which doesn't seem to work due to validation. I'm removing all None values in order to be able to parse it with .parse() the line below

document = response_parser.parse(jresult)

jresult['DocumentMetadata']['Pages'] -> 29 but the pages object only has 3 pages as it is the first ASYNC call.

Question: is there another way to load the ASYNC outputs so this doesn't happen?

Belval commented 9 months ago

I think you are seeing this issue because you are only loading part of the response (even in S3 the response it paginated). This is something that we are planning to release an update for soon, but in the meantime you can use this function: https://github.com/aws-samples/amazon-textract-textractor/blob/master/caller/textractcaller/t_call.py#L325

Righs commented 9 months ago

Thanks for the quick response, that seems to do the trick by taking the entire S3 key with all the paginated results indeed.

For now I solved it writing the complete JSON to a non-paginated S3 key via the method you showed above:

client = boto3.client('s3')
full_json = json.dumps(doc.response).encode()
upload_to_s3(client, 's3://mysource', full_json)

Then loading it again via the reponse_parser.parse() which seems to work. It feels a bit cleaner to me as well because you don't have all the responses paginated in S3.

Do you see any downsides solving it this way?

Belval commented 9 months ago

If you can afford to wait for the Textract response then no there are no downsides to your approach. However if you want to launch a large job (say a +1000 pages PDFs) then the server where you are running the above will have to wait for Textract to finish processing the asset before saving it to S3, whereas s3_output_path would do it automatically (albeit in the paginated format).

It really depends on how your infrastructure is built. If running on AWS Lambda then it's definitely worth using s3_output_path, if using an EC2 instance that's always on anyway then it does not matter as much.

Hopefully that helps.