Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

fix(msg): use python-oxmsg for MSG email parsing #3142

Closed scanny closed 4 months ago

scanny commented 4 months ago

Summary partition_msg() previously used the msg_parser library for parsing Outlook MSG email files (.msg files). The msg_parser library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments.

Use the new and permissively licenced python-oxmsg library instead.

Additional Context For reviewability purposes, this PR temporarily places the new partition_msg() implementation in new_msg.py and references that implementation from msg.py. new_msg.py will be renamed to msg.py in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written partition_msg() implementation.

Fixes #2481 Fixes #3006