aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
407 stars 145 forks source link

Lambda layers for Python 3.12 PDF raising an exception on missing libpng16.so.16 #373

Closed Viajante80 closed 4 months ago

Viajante80 commented 5 months ago

lambda-layers 50 https://github.com/aws-samples/amazon-textract-textractor/actions/runs/9550648081 artifacts - textractor-lambda-p312-pdf

"errorMessage": "Unable to get page count.\npdfinfo: error while loading shared libraries: libpng16.so.16: cannot open shared object file: No such file or directory\n",
"errorType": "PDFPageCountError",
"stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 27, in lambda_handler\n    textract = extractor.start_document_analysis(\n",
    "  File \"/opt/python/textractor/textractor.py\", line 575, in start_document_analysis\n    images = self._get_document_images_from_path(original_file_source)\n",
    "  File \"/opt/python/textractor/textractor.py\", line 133, in _get_document_images_from_path\n    images = convert_from_bytes(bytearray(file_obj))\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 359, in convert_from_bytes\n    return convert_from_path(\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 127, in convert_from_path\n    page_count = pdfinfo_from_path(\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 611, in pdfinfo_from_path\n    raise PDFPageCountError(\n"
  ]
Belval commented 5 months ago

Probably the same issue as #372 but with a different library. Seems like a new version of the lambda environment is numbering their libraries at the name level.

Change would be here: https://github.com/aws-samples/amazon-textract-textractor/blob/master/.github/workflows/lambda_layers.yml#L355

We will address this issue by the end of the day, thank you for flagging it.

Viajante80 commented 5 months ago

Thank you @Belval I tested build 51 and got a new error

Response { "errorMessage": "Unable to get page count.\npdfinfo: error while loading shared libraries: libplc4.so: cannot open shared object file: No such file or directory\n", "errorType": "PDFPageCountError", "requestId": "5626e07d-6d35-4698-a0d9-c01447b43502", "stackTrace": [ " File \"/var/task/lambda_function.py\", line 27, in lambda_handler\n textract = extractor.start_document_analysis(\n", " File \"/opt/python/textractor/textractor.py\", line 575, in start_document_analysis\n images = self._get_document_images_from_path(original_file_source)\n", " File \"/opt/python/textractor/textractor.py\", line 133, in _get_document_images_from_path\n images = convert_from_bytes(bytearray(file_obj))\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 359, in convert_from_bytes\n return convert_from_path(\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 127, in convert_from_path\n page_count = pdfinfo_from_path(\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 611, in pdfinfo_from_path\n raise PDFPageCountError(\n" ] }

Belval commented 5 months ago

This is fixed in the latest lambda layer version.