Open azaylamba opened 17 hours ago
One observation is that the issue seems to be with the PDF files generated via print function on Windows system. The PDF producer is Microsoft: Print to PDF
for the files where I am getting the issue.
@azaylamba ,
It looks like it needs the training data to convert these files.
Removing this line might fix the problem but the docker image will be bigger (and processing slower). Note it's not the same folder.
https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/lib/shared/file-import-dockerfile#L5
An alternative solution listed here would be to run https://github.com/Unstructured-IO/unstructured/issues/3290#issue-2371970753 apk add tesseract-eng
in the docker file (but it seems resolved, maybe it's using an older base image?)
I am getting the following error while uploading certain PDF files. This is reproducible every time with some PDF files.
Working fine for most of the PDF files.