chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

Content returns gibberish for some PDFs #376

Closed alfonsrv closed 1 year ago

alfonsrv commented 2 years ago

Tika works fine for most PDFs – however I have some files, that Tika simply returns gibberish for in the content.

Not sure as to why it is, since the parser interface doesn't seem to allow for more elaborated configuration. Using Acrobat / the browser, the text is selectable without any issues and using a simple pdf2text tool returns the content as expected too.

The file is protected with the PDF/A-3b standard; however when protecting another file with PDF/A-3b its contents return fine – so I don't think it is related.

>>> path=r'/mnt/files/R22118600.pdf'
>>> from tika import parser
>>> parser.from_file(path)
2022-11-01 15:08:41,876 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
{
    "metadata": {
        "Author": "CVS",
        "Content-Type": "application/pdf",
        "Creation-Date": "2022-07-23T06:16:52Z",
        "Keywords": "FooBar Company",
        "Last-Modified": "2022-07-23T06:16:52Z",
        "Last-Save-Date": "2022-07-23T06:16:52Z",
        "X-Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "X-TIKA:content_handler": "ToTextContentHandler",
        "X-TIKA:embedded_depth": "0",
        "X-TIKA:parse_time_millis": "724",
        "access_permission:assemble_document": "true",
        "access_permission:can_modify": "true",
        "access_permission:can_print": "true",
        "access_permission:can_print_degraded": "true",
        "access_permission:extract_content": "true",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:fill_in_form": "true",
        "access_permission:modify_annotations": "true",
        "cp:subject": "Rechnungen",
        "created": "2022-07-23T06:16:52Z",
        "creator": "CVS",
        "date": "2022-07-23T06:16:52Z",
        "dc:creator": "CVS",
        "dc:description": "Rechnungen",
        "dc:format": ["application/pdf; version=\"A - 3 b\"", "application/pdf; version=1.7"
        ],
        "dc:subject": "FooBar Company",
        "dc:title": "Rechnung",
        "dcterms:created": "2022-07-23T06:16:52Z",
        "dcterms:modified": "2022-07-23T06:16:52Z",
        "description": "Rechnungen",
        "meta:author": "CVS",
        "meta:creation-date": "2022-07-23T06:16:52Z",
        "meta:keyword": "FooBar Company",
        "meta:save-date": "2022-07-23T06:16:52Z",
        "modified": "2022-07-23T06:16:52Z",
        "pdf:PDFVersion": "1.7",
        "pdf:charsPerPage": "1569",
        "pdf:docinfo:created": "2022-07-23T06:16:52Z",
        "pdf:docinfo:creator": "CVS",
        "pdf:docinfo:creator_tool": "ALPHAPLAN 5 - Version 5.2.4600.490",
        "pdf:docinfo:keywords": "FooBar Company",
        "pdf:docinfo:modified": "2022-07-23T06:16:52Z",
        "pdf:docinfo:producer": "Amyuni Document Converter version 6.0.2.9; Amyuni PDF Creator 6.5.0.5 rev 12855 CDIntf",
        "pdf:docinfo:subject": "Rechnungen",
        "pdf:docinfo:title": "Rechnung",
        "pdf:encrypted": "false",
        "pdf:hasMarkedContent": "false",
        "pdf:hasXFA": "false",
        "pdf:hasXMP": "true",
        "pdf:unmappedUnicodeCharsPerPage": "0",
        "pdfa:PDFVersion": "A-3b",
        "pdfaid:conformance": "B",
        "pdfaid:part": "3",
        "producer": "Amyuni Document Converter version 6.0.2.9; Amyuni PDF Creator 6.5.0.5 rev 12855 CDIntf",
        "resourceName": "b\'R221186500.pdf\'",
        "subject": "Rechnungen",
        "title": "Rechnung",
        "xmp:CreatorTool": "ALPHAPLAN 5 - Version 5.2.4600.490",
        "xmpTPg:NPages": "1"
    },
    "content": "\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nRechnung\\n\\n\\n!\\"
    # $ % & % \'\\n!\\"\\"##$%&%\'\\n\\n*R13370815*\\n(\\")*\\" # ( #\\n\\n!\\"#$%&%\'(+,*\\"%\\n!)*+,-,./,012 !\\"\\"##$%&%\'\\n!)*+,-,./345-62 \\"\\"1781\\"7\\"\\"\\n9+0 :-;504. <=62 \\"#1781\\"7\\"\\"\\n>)/5)??,-66)02 777&@#\'@&7\\n\\n!4- AB/5)6C)045-,. D6CE\\n\\n(,#$-\\",!-\\")*\\"!\\nF)050G)C/G,,),3G),/5\\n\\n.)\\"/\\"!,%(#$!)/*\\n!4- AB/5)6C)045-,. D6CE\\nH+)6,G5I)0 A504J) #%\\nKL87&\'8 A5-55.405\\n\\nH+)6,G5I)0 A504J) #%\\n\\nKL 87&\'8 A5-55.405\\n\\n)$!\\" +,*\\"%\\n9+0) M31LM5=N012 8777#@# ( K9O8#$@744 F)0C4,32 APN:QRN\\n9+0) SA5L931N012 KT\\"#8@#78## 9+0) UG5.?G)3/,-66)02 :MHO7$7&(\\"O\'8@87\'\\n\\nVG0 ?G);)0, 9+,), I- -,/)0),W 6G5 9+,), <)0)G,C405),  F)0X4-;/L -,3 D)/*+Y;5/C)3G,.-,.), Z[[[1+)0[)*X13)(4.C\\\\1\\n\\n-\\"0\\")#$%&%\' 1 *\\"2*34()*)4% 5\\"%\'\\" ,!*)6\\".7\\n%&55\\"!\\n\\n\\")%0\\".7\\n3!\\")(\\n\\n(*\\"&\\"! !,-,** -\\"*!,\'\\n\\n:-/ ]G);)0/*+)G, ]\\"\\"#\\"O%$$@ <=6 \\"\\"1781\\"7\\"\\"2\\n\\n# \\" \\"&77@& ^4C04 T<=?<)\\" $&W SH A5)0)= >4/G//545G=, >?4*XW SA>LH\\n:-/ :-;504.2 >\\"\\"#\\"\\"&&7$ (\\n:-/ ]G);)0/*+)G,2 ]\\"\\"#\\"O%$$@ (  .)?G);)05 462 \\"\\"1781\\"7\\"\\"\\nL_ M=66G//G=,2 A`R!a K!9F9ND\\nE)0/51 :051 N012 \\"$&\'\'L\'$\'L$$\'\\n]G);)04,/*+0G;52 !4- AB/5)6C)045-,. D6CEW H+)6,G5I)0 A504J)\\n#%W 87&\'8 A5-55.405\\n>)/5)??,-66)02 777&@#\'@&7\\n\\n&O8W\'7 b #\'c SA5 L@&c &\'#W%\' b\\n\\n8\\"!(,%+,!* F]RD 3+? N45G=,4?\\n\\n0,$.&%\'(,!* dC)0[)G/-,. %\\"**4-,()( (*\\"&\\"! -\\"*!,\' \'\\"(,5*-\\"*!,\'\\n\\n0,$.&%\' -)( \\"#17$1\\"7\\"\\" S6/45I/5)-)0 &\'#W%\' b #\'W77 c ##\\"W@\\" b 87@W## b\\n\\n0,$.&%\'(7\\n-\\"+)%\'&%\'\\n\\ne),504?0).-?G)0-,. AB,4f=,W\\n]G);)04,5),LN-66)02@$\\"@ -!&**47!\\"#$%&%\'(-\\"*!,\'9 87@W## b\\n\\nE)0[)*X :D\\nD)G/5XG0*+)0 A501 #$\\nKL%%O$% A51 9,.C)05\\n7%$\'@ ( O$$O L 7\\n[[[1+)0[)*X13)\\n\\nF=0/54,32\\n^g0. E)0[)*X\\nKG)5)0 `+G?GhhG\\nE4,/L^i0.), VG5;)?3\\n:-;/G*+5/045 ZF=0/1\\\\2\\n`0=;1 K01 !-3=?; U=+0\\n\\n]4,3)/C4,X A440\\n>9H2 A:]: KT &&QQQ\\n9>:N2 KT&O &\'7& 7777 77\\"7 7#@7 @\'\\n\\nSA51L9K2 KT#O$#7\\"&O@\\n9]N @7 7&\'O$ 77777 &\\n:65/.)0G*+5 A440C0i*X),\\nE!> O$OO\\n\\n\\n\\n\\n", 
    "status": 200}
tballison commented 2 years ago

This is probably better handled over on the Tika user mailing list or on our JIRA (https://issues.apache.org/jira/projects/TIKA). Are you able to share the file? Are you able to try with a newer version of tika-app, e.g. java -jar tika-app-2.5.0.jar R22118600.pdf (download tika-app: https://dlcdn.apache.org/tika/2.5.0/tika-app-2.5.0.jar).

See also: https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems

alfonsrv commented 2 years ago

Thanks for the pointer. Output stays unchanged with PDFBox.

I created an issue with Apache PDFBox https://issues.apache.org/jira/browse/PDFBOX-5540 – for anyone interested. Also includes the files in question

alfonsrv commented 1 year ago

Closed – was an issue in PDFBox's Unicode encoding.