Open DarioBernardo opened 7 months ago
Hi @DarioBernardo Can you please share the PDF document (c_20230111133942393_2525540.pdf
)?
Hi @christinestraub thank you for looking into my issue, no unfortunately I can't share the document, but I am sure the issue is replicable with most greek documents. Something I think worth mentioning is that the document is a scan of a paper document, hence it is made from images.
I'd like to provide some additional context regarding the issue. I searched online for publicly available PDF documents that could help replicate the problem. I've confirmed that the issue arises when the API attempts to perform OCR on characters from images in PDFs. Specifically, when the PDF is a scan of a document, the OCR tool behind the API fails to recognize Greek characters and substitutes them with ASCII characters instead. However, if the content can be directly read from the PDF, the correct non-ASCII Unicode escape characters are provided. This may be due to limitations in Tesseract, which I believe is the OCR tool behind the API.
For instance, you can test this using the document available here. The document title, being part of an image, is not recognized correctly, whereas the rest of the document, which is text-based, is accurately processed.
Describe the bug I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from PDF documents in Greek, the output text appears in a non-Greek alphabet and is unreadable, making it impossible to use for my purposes.
To Reproduce This is the code I am using, running it on any greek document will reproduce the error:
Expected behavior I expect the extracted text to accurately represent the original Greek characters from the PDF document.
Actual results The extracted text contains characters that are not in the Greek alphabet, rendering the text unreadable. Here's a snippet of what I get:
Additional context
Could this issue be due to a missing OCR plugin for the Greek language? Since I am utilizing the API, I would expect such components to be managed server-side.