aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
360 stars 134 forks source link

Lambda layers for Python 3.12 PDF raising an exception on missing libpng16.so.16 #373

Open Viajante80 opened 2 weeks ago

Viajante80 commented 2 weeks ago

lambda-layers 50 https://github.com/aws-samples/amazon-textract-textractor/actions/runs/9550648081 artifacts - textractor-lambda-p312-pdf

"errorMessage": "Unable to get page count.\npdfinfo: error while loading shared libraries: libpng16.so.16: cannot open shared object file: No such file or directory\n",
"errorType": "PDFPageCountError",
"stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 27, in lambda_handler\n    textract = extractor.start_document_analysis(\n",
    "  File \"/opt/python/textractor/textractor.py\", line 575, in start_document_analysis\n    images = self._get_document_images_from_path(original_file_source)\n",
    "  File \"/opt/python/textractor/textractor.py\", line 133, in _get_document_images_from_path\n    images = convert_from_bytes(bytearray(file_obj))\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 359, in convert_from_bytes\n    return convert_from_path(\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 127, in convert_from_path\n    page_count = pdfinfo_from_path(\n",
    "  File \"/opt/python/pdf2image/pdf2image.py\", line 611, in pdfinfo_from_path\n    raise PDFPageCountError(\n"
  ]
Belval commented 2 weeks ago

Probably the same issue as #372 but with a different library. Seems like a new version of the lambda environment is numbering their libraries at the name level.

Change would be here: https://github.com/aws-samples/amazon-textract-textractor/blob/master/.github/workflows/lambda_layers.yml#L355

We will address this issue by the end of the day, thank you for flagging it.

Viajante80 commented 2 weeks ago

Thank you @Belval I tested build 51 and got a new error

Response { "errorMessage": "Unable to get page count.\npdfinfo: error while loading shared libraries: libplc4.so: cannot open shared object file: No such file or directory\n", "errorType": "PDFPageCountError", "requestId": "5626e07d-6d35-4698-a0d9-c01447b43502", "stackTrace": [ " File \"/var/task/lambda_function.py\", line 27, in lambda_handler\n textract = extractor.start_document_analysis(\n", " File \"/opt/python/textractor/textractor.py\", line 575, in start_document_analysis\n images = self._get_document_images_from_path(original_file_source)\n", " File \"/opt/python/textractor/textractor.py\", line 133, in _get_document_images_from_path\n images = convert_from_bytes(bytearray(file_obj))\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 359, in convert_from_bytes\n return convert_from_path(\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 127, in convert_from_path\n page_count = pdfinfo_from_path(\n", " File \"/opt/python/pdf2image/pdf2image.py\", line 611, in pdfinfo_from_path\n raise PDFPageCountError(\n" ] }

Belval commented 1 week ago

This is fixed in the latest lambda layer version.