alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
1.99k stars 265 forks source link

BUG: Aleph doesn't process all .msg email formats correctly #3733

Open brrttwrks opened 2 months ago

brrttwrks commented 2 months ago

Describe the bug .msg email file format has had several versions and it seems that Aleph doesn't parse all of them correctly. This leads to us needing to convert them to eml format before ingesting into Aleph. The tool I've been using to convert the msg emails is msgconvert (https://www.matijs.net/software/msgconv/) The current state is problematic as Aleph gives the perception that it does process them, but some might be processed correctly and some seem to only show parts of the body of the email and none of the attachments. If it is possible to detect the different versions and parse them accordingly, then we wouldn't necessarily need to pre-process them and journalists wouldn't be surprised by the results.

To Reproduce Steps to reproduce the behavior:

  1. Will share with you separately as the only examples I have are sensitive.

Expected behavior All msg versions get parsed and ingested properly in Aleph.

Aleph version 4.0.0rc1

Screenshots Cannot share.

Additional context None