aws-samples / amazon-comprehend-semi-structured-documents-annotation-tools

Other
24 stars 15 forks source link

Support/fix nested folders of source documents #28

Open athewsey opened 1 year ago

athewsey commented 1 year ago

Today I tried to use the CLI to create a labelling job from an S3 prefix with nested folders - for example:

s3://comprehend-semi-structured-docs-${AWS_REGION}-${AWS_ACCOUNT_ID}/
    docs/
        folder-1/
            doc1.pdf
        folder-2/
            doc2.pdf
            doc3.pdf

I created the job using the base folder as the input path, something like python bin/comprehend-ssie-annotation-tool-cli.py --input-s3-path s3://{bucket}/docs/

...But the job failed due to NoSuchKey errors in the pre-annotation Lambda function:

[ERROR] NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Traceback (most recent call last):
  File "/var/task/lambdas/pre_human_task_lambda.py", line 346, in lambda_handler
    pdf_s3_resp = s3_client.get_object_response_from_s3(source_ref)
  File "/var/task/utils/s3_helper.py", line 36, in get_object_response_from_s3
    return self.s3_client.get_object(Bucket=bucket, Key=path)
  File "/var/task/botocore/client.py", line 514, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/task/botocore/client.py", line 938, in _make_api_call
    raise error_class(parsed_response, operation_name)

Looking at the generated input manifest, it seems like the preparation hasn't correctly handled the nested folders because everything is showing as filenames directly under the input-s3-path - for example:

{"source-ref": "s3://{bucket}/docs/doc1.pdf", "page": "1", "metadata": {...}, "annotator-metadata": null, "primary-annotation-ref": null, "secondary-annotation-ref": null}
{"source-ref": "s3://{bucket}/docs/doc2.pdf", "page": "1", "metadata": {...}, "annotator-metadata": null, "primary-annotation-ref": null, "secondary-annotation-ref": null}
{"source-ref": "s3://{bucket}/docs/doc3.pdf", "page": "1", "metadata": {...}, "annotator-metadata": null, "primary-annotation-ref": null, "secondary-annotation-ref": null}

I didn't see anything in the README guide suggesting we can't have nested folders in the input prefix, so not quite sure whether this is a bug or a feature request?