aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
404 stars 145 forks source link

start_document_analysis does not support List of Images #184

Open tarunn2799 opened 1 year ago

tarunn2799 commented 1 year ago

start_document_analysis in the documentation says it supports a list of PIL images, but in the source code https://github.com/aws-samples/amazon-textract-textractor/blob/e40f5b0378f9ee24d0a757de414505fb06a4471f/textractor/textractor.py#L488

it only accepts a string, a bytearray, or a PIL Image. How do I pass multiple images to this API?

ThomasDelteil commented 1 year ago

You are right this seems to be a left-over from a previous implementaiton. The best way to pass multiple PIL images would simply be to use a for-loop and the sync API. like this:

documents = [extractor.analyze_document(file_source=image, features=[TextractFeatures.FORMS]) for image in images]

Alternatively you can transform your image into a single pdf file and use the ASYNC start_document_analysis api.

Belval commented 1 year ago

I opened a PR #190 to update the documentation, but I will keep this issue open as a feature enhancement as supporting List[PIL.Image] as input would improve usability.