Open azaylamba opened 1 day ago
One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is Microsoft: Print to PDF
for the files where I am getting the issue.
@azaylamba ,
It looks like it needs the training data to convert these files.
Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder.
https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5
An alternative solution listed here would be to run https://github.com/Unstructured-IO/unstructured/issues/3290#issue-2371970753 apk add tesseract-eng
in the docker file (but it seems resolved, maybe it's using an older base image?)
Sample file to reproduce the issue FileUploadErrorSample.pdf
@charles-marion I tried with the latest version 0.16.9
of unstructured
but the issue still persisted.
Issue is resolved after adding RUN apk add --no-cache tesseract-eng
in https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile
So it seems tesseract-eng is required to process such PDF files.
@charles-marion Please let me know if you think this is the correct approach to fix this and you want me to raise a PR.
I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.
Working fine for most of the PDF files.
Sample file to reproduce the issue FileUploadErrorSample.pdf