an RTF message similar to the below is wrongly detected as message/rfc822 and handed off to the email parser even though it starts with an RTF tag. This is likely due to the presence of "Sent: " and "Received: " email-header-like lines. I guess that the detection phase tries email first and that accepts it, rather than getting a confidence score from each detector/parser and then choosing the one with the highest score.
Can we tell TIKA from python to use a specific parser (RTF in this case) since we know the message type? Rather than detecting at all? If not, can this be done on the JAVA side?
Note that we are running with an external java server, so tika-python is in client only mode.
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil Segoe UI;}{\f1\fnil\fcharset0 Segoe UI;}}
{\colortbl ;\red0\green0\blue0;}
{\*\generator Riched20 16.0.4732}\viewkind4\uc1
\pard\cf1\outl\f0\fs20 2020-10-06 04:04:54,156 - INFO - Java classpath is : omitted.jar\par
UUID 1601970609838951126 offset/capacity8/4194304host:portomitted\par
UUID 1601971058328861560 offset/capacity56/4194304host:portomitted\par
UUID 1601971058306151897 offset/capacity104/4194304host:portomittedaccessKeyId xxxxxx\par
[host:port=,accessKeyId=xxxxx] Sent/Received sequence number is 4/4.\par
Sent: (4):\par
Received: (4): 1\par
Updated Sent/Received sequence number to 4/1.\par
Not able to find sequenceNumbers for sessionId: 1601971058328861560 Exception :Can not find sequence number for session 1601971058328861560\par
Not able to find sequenceNumbers for sessionId: 1601971058306151897 Exception :Can not find sequence number for session 1601971058306151897\outl0\f1\lang18441 \f0\lang1033\par
{\*\lyncstorytitle No Title}{\*\lyncflags<rtf=1>}}
You can tell Tika what to use by passing a custom configuration file for Tika config that is used by the Tika server on startup. See the docs to set a custom TIka config file that should take care of it.
