aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
360 stars 134 forks source link

Use pypdfium2 for PDF rasterizing when possible #376

Closed Belval closed 1 week ago

Belval commented 1 week ago

Issue #, if available: N/A

Description of changes: Probably one of the most requested feature, with this change Textractor will support PDF rasterization with pypdfium2 which is simpler to install than pdf2image. For backward compatibility both will be supported for the foreseeable future. Additionally, removes the PDF page count check in the synchronous functions and simply catch the exception if a multipage PDF is sent to the wrong API.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.