aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
360 stars 134 forks source link

Save image doesn't work with S3 path - TypeError: Invalid input type 'bytearray' #382

Closed steffeng closed 1 week ago

steffeng commented 1 week ago

The saving of images doesn't work with files already uploaded to s3 with pypdfium2 installed (and pdf2image not installed).

This code works:

document = extractor.start_document_analysis(
    file_source="/Users/<redacted path>/paper.pdf", features=[TextractFeatures.LAYOUT],
    s3_upload_path="s3://<redacted bucket>/test/"
)

This code doesn't work:

document = extractor.start_document_analysis(
    file_source="s3://<redacted bucket>/texttract-test/paper.pdf", features=[TextractFeatures.LAYOUT]
)

I see that the code works when I remove the additional bytearray wrapper of the bytes here and take the file_obj/ bytes directly as in this line for local files.

Error:

{
    "name": "TypeError",
    "message": "Invalid input type 'bytearray'",
    "stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 9
      5 extractor = Textractor(region_name=\"eu-central-1\")
      7 s3_path = \"s3://<redacted bucket>/texttract-test/paper.pdf\"
----> 9 document = extractor.start_document_analysis(
     10     file_source=s3_path, features=[TextractFeatures.LAYOUT]
     11 )

File ~/venvs/p3/lib/python3.9/site-packages/textractor/textractor.py:579, in Textractor.start_document_analysis(self, file_source, features, s3_output_path, s3_upload_path, queries, client_request_token, job_tag, save_image)
    577         images = original_file_source
    578     else:
--> 579         images = self._get_document_images_from_path(original_file_source)
    581 return LazyDocument(
    582     response[\"JobId\"],
    583     TextractAPI.ANALYZE,
   (...)
    586     output_config=output_config,
    587 )

File ~/venvs/p3/lib/python3.9/site-packages/textractor/textractor.py:138, in Textractor._get_document_images_from_path(self, filepath)
    136 if filepath.lower().endswith(\".pdf\"):
    137     if IS_PDF_RENDERING_ENABLED:
--> 138         images = rasterize_pdf(bytearray(file_obj))
    139     else:
    140         raise MissingDependencyException(
    141             \"pdf2image is not installed. If you do not plan on using visualizations you can skip image generation using save_image=False in your function call.\"
    142         )

File ~/venvs/p3/lib/python3.9/site-packages/textractor/utils/pdf_utils.py:23, in rasterize_pdf(pdf)
     19 \"\"\"
     20 Convert a pdf into a list of images
     21 \"\"\"
     22 if PYPDFIUM2_IS_INSTALLED:
---> 23     pdf = pypdfium2.PdfDocument(pdf)
     24     return [page.render(scale=250 / 72).to_pil() for page in pdf]
     25 elif PDF2IMAGE_IS_INSTALLED:

File ~/venvs/p3/lib/python3.9/site-packages/pypdfium2/_helpers/document.py:78, in PdfDocument.__init__(self, input, password, autoclose)
     76     self.raw = self._input
     77 else:
---> 78     self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose)
     79     self._data_holder += to_hold
     80     self._data_closer += to_close

File ~/venvs/p3/lib/python3.9/site-packages/pypdfium2/_helpers/document.py:674, in _open_pdf(input_data, password, autoclose)
    672     pdf = pdfium_c.FPDF_LoadCustomDocument(bufaccess, password)
    673 else:
--> 674     raise TypeError(f\"Invalid input type '{type(input_data).__name__}'\")
    676 if pdfium_c.FPDF_GetPageCount(pdf) < 1:
    677     err_code = pdfium_c.FPDF_GetLastError()

TypeError: Invalid input type 'bytearray'"
}
Belval commented 1 week ago

Thank you for reporting this issue, we should have a fix released to PyPI by EoD.

Belval commented 1 week ago

Textractor v1.8.2 is now available in PyPI (it may take up to 30 minutes for the cache to refresh).

steffeng commented 1 week ago

Thanks, @Belval. I can confirm the fix on my end. Closing the issue.