chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

TIKA mistakes RTF message for email #322

Closed altinp closed 4 years ago

altinp commented 4 years ago

an RTF message similar to the below is wrongly detected as message/rfc822 and handed off to the email parser even though it starts with an RTF tag. This is likely due to the presence of "Sent: " and "Received: " email-header-like lines. I guess that the detection phase tries email first and that accepts it, rather than getting a confidence score from each detector/parser and then choosing the one with the highest score.

Can we tell TIKA from python to use a specific parser (RTF in this case) since we know the message type? Rather than detecting at all? If not, can this be done on the JAVA side? Note that we are running with an external java server, so tika-python is in client only mode.

Thanks.

TIKA server returns these metadata:

realJson = [{
        "Content-Type": "message/rfc822",
        "Message:Raw-Header:Received": " (4): 1\\par",
        "Message:Raw-Header:Sent": "(4):\\par",
        "Message:Raw-Header:[host": "port\u003domitted,accessKeyId\xxx] Sent/Received sequence number is 4/4.\\par",
        "X-Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.mail.RFC822Parser"],
        "X-TIKA:embedded_depth": "0",
        "X-TIKA:parse_time_millis": "5"
    }
]
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil Segoe UI;}{\f1\fnil\fcharset0 Segoe UI;}}
{\colortbl ;\red0\green0\blue0;}
{\*\generator Riched20 16.0.4732}\viewkind4\uc1 
\pard\cf1\outl\f0\fs20 2020-10-06 04:04:54,156 - INFO - Java classpath is : omitted.jar\par
 UUID 1601970609838951126 offset/capacity8/4194304host:portomitted\par
 UUID 1601971058328861560 offset/capacity56/4194304host:portomitted\par
 UUID 1601971058306151897 offset/capacity104/4194304host:portomittedaccessKeyId xxxxxx\par
[host:port=69.50.113.49:62777,accessKeyId=xxxxx] Sent/Received sequence number is 4/4.\par
Sent: (4):\par
Received:  (4): 1\par
Updated Sent/Received sequence number to 4/1.\par
Not able to find sequenceNumbers for sessionId: 1601971058328861560 Exception :Can not find sequence number for session 1601971058328861560\par
Not able to find sequenceNumbers for sessionId: 1601971058306151897 Exception :Can not find sequence number for session 1601971058306151897\outl0\f1\lang18441  \f0\lang1033\par
{\*\lyncstorytitle No Title}{\*\lyncflags<rtf=1>}}
chrismattmann commented 4 years ago

You can tell Tika what to use by passing a custom configuration file for Tika config that is used by the Tika server on startup. See the docs to set a custom TIka config file that should take care of it.