Open vsessink opened 2 months ago
This (previously WIP, now abandoned) mentions the same problems https://github.com/alephdata/ingest-file/pull/20
Where is the mime detection done? I think it could work to try fix the output of readpst - by adding transport headers or otherwise; fix the RTF-parts of the e-mails, too. As I already walk over all e-mails to fix the RTF-parts, adding required headers for mime detection (message/rfc822 instead of text/html) could work, too.
Messages can pretty easily be "tricked" into being message/rfc822, by simply adding Received: from localhost (127.0.0.1)
at the top of the message. IMHO as a workaround for the current state of things, this could be done right after readpst. I will investigate.
In order to fix messages that have an RTF-only message body, I'm manually starting a Python script:
#!/usr/bin/python3
import base64
import os
import sys
import re
import mimetypes
import email
from email.policy import default
from email.parser import BytesParser
import subprocess
plcy=default.clone(refold_source='none')
for fname in sys.argv[1:]:
try:
mail=open(fname,'rb')
except:
print(fname, "not found.")
continue
msg = BytesParser(policy=plcy).parse(mail)
mail.close()
totaal=list(msg.walk())
if (len(totaal)<2):
continue
if (totaal[1].get_content_type() == 'application/rtf'):
print("Converting", fname)
html=subprocess.run(['/usr/bin/unrtf'], input=totaal[1].get_content(), capture_output=True).stdout
totaal[1].set_content(html, maintype='text',subtype='html')
try:
mail=open(fname,'w')
except:
print("Error writing")
continue
print(totaal[0], file=mail)
mail.close()
It's a hack. But it works and it really helps the search process. This could be run right after readpst but I really don't think this is production quality. Anyway, maybe it helps someone make a proper fix.
OK, here's more analysis and an awful corner case. I'm documenting it here because I don't think there's a better place. I unpacked a pst file with the regular readpst -e -D -8 -cv
. One of these messages contains two message/rfc822
attachments having rtf-body.rtf
for content type. So the script above should be made recursive.
Then the awful part of the finding is, that my attachments begin with
Content-Type: message/rfc822
>From "mailaddress@example.com" Tue Oct 4 14:22:48 2023
That shouldn't happen, the readpst man page says that for -e
This format has no from quoting
. (Where from quoting means prepending the word From
with a >
character). However, it apparently does. You must remove the >
, otherwise the EmailMessage is wrongly interpreted.
Looking briefly, the From quoting problem is in readpst, libpst/src/readpst.c
, where write_embedded_message writes out messages with the "From" quoting parameter ("embedding") being 1.
write_normal_email(f_output, "", item, MODE_NORMAL, 0, pf, save_rtf,
1, extra_mime_headers);
While importing an e-mail-archive in the (IMHO cursed) .PST-format, I came across a mailbox having all
application/rtf
for body type.Yep, that's right:
Content-Disposition: attachment
, but still this is the actual e-mail body.Now in Aleph, these messages will show up as empty, with
rtf-body.rtf
document as attachment.I tried to work around it by unpacking the mail archive manually with readpst; then fixing the messages with a small python script (essentially replacing the
rtf
part with anhtml
part. I used python'semail.parser
and simply checked if the firstcontent_type
would beapplication/rtf
- if so, pipe that throughunrtf
and repack the message. Filthy, but working for the mail box itself).This workaround would not help in Aleph, because the mime detection wizardry afterwards recognized
text/html
for mime type, instead of message/rfc822 - and actual attachments of the message would not be recognized anymore.The latter may count as a separate bug: a message that starts with the following should IMHO not be detected as
text/html
?