Open matwerber1 opened 5 years ago
Tested the following addition:
from urllib.parse import unquote_plus
...
document = unquote_plus(record['s3']['object']['key'])
That worked fine but ran into a new issue...
The JobTag
and ClientRequestToken
parameters of textract.start_document_text_detection()
do not allow spaces or plusses (though they do allow dashes).
Maybe the S3 object's MD5 should instead be used as the value of ClientRequestToken
?
Not sure about JobTag
... haven't looked at rest of code yet, not sure if JobTag matters?
Created this PR to address the issue: #3
Hi,
Per S3 docs:
The current extract-Enhancer-TextractAsyncJobSubmitFunction Lambda does not URL decode the S3 key received in the event JSON, so I'm receiving errors when trying to parse objects that contain spaces or other URL-encoding relevant characters.
For example, If I upload a document named
my test.pdf
, the S3 event sent to the extract-Enhancer-TextractAsyncJobSubmitFunction function contains the key Records[0].s3.object.key =my+test.pdf
.The Textract API calls
textract.start_document_analysis()
andtextract.start_document_text_detection()
then fail because the DocumentLocation parameter has a value ofmy+test.pdf
when it should instead bemy test.pdf
.Can you add URL decoding to the S3 key name in the received events?