4teamwork / ftw.tika

This product integrates Apache Tika for full text indexing with Plone.
4 stars 1 forks source link

TikaException: Unable to extract PDF content #10

Closed jone closed 10 years ago

jone commented 10 years ago

While reindexing a bunch of files I had this exception multiple times:

WARNING ftw.tika Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:88)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:141)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:417)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:111)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document, details:
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:327)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:72)
    ... 7 more
Caused by: org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.prepareForDecryption(StandardSecurityHandler.java:262)
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:154)
    at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1504)
    at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:914)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323)
    ... 8 more

It seems that tika is not that good with PDFs. There is also a Plone standard transform pdf_to_text. Is there a reason why we override the standard transform with the tika transform?

lukasgraf commented 10 years ago

@jone Error: The supplied password does not match either the owner or user password in the document - Seems pretty obvious to me, those are encrypted PDFs, no other implementation would be able to extract text from any of those PDFs.

However, since these seems to be a common occurence, ftw.tika could watch out for that specific Java exception, and then don't spam the entire traceback into the logs but just log a warning/info along the lines of "Couldn't extract text from encrypted PDF, skipping...".

jone commented 10 years ago

:+1:

maethu commented 10 years ago

:+1: