Closed alfonsrv closed 1 year ago
This is probably better handled over on the Tika user mailing list or on our JIRA (https://issues.apache.org/jira/projects/TIKA). Are you able to share the file? Are you able to try with a newer version of tika-app, e.g. java -jar tika-app-2.5.0.jar R22118600.pdf (download tika-app: https://dlcdn.apache.org/tika/2.5.0/tika-app-2.5.0.jar).
Thanks for the pointer. Output stays unchanged with PDFBox.
I created an issue with Apache PDFBox https://issues.apache.org/jira/browse/PDFBOX-5540 – for anyone interested. Also includes the files in question
Closed – was an issue in PDFBox's Unicode encoding.
Tika works fine for most PDFs – however I have some files, that Tika simply returns gibberish for in the content.
Not sure as to why it is, since the
parser
interface doesn't seem to allow for more elaborated configuration. Using Acrobat / the browser, the text is selectable without any issues and using a simplepdf2text
tool returns the content as expected too.The file is protected with the PDF/A-3b standard; however when protecting another file with PDF/A-3b its contents return fine – so I don't think it is related.