TeamMsgExtractor / msg-extractor

Extracts emails and attachments saved in Microsoft Outlook's .msg files
GNU General Public License v3.0
729 stars 171 forks source link

Add Parsing for .olm (Outlook for Mac) Files. #244

Open TheElementalOfDestruction opened 2 years ago

TheElementalOfDestruction commented 2 years ago

Add support for generally parsing and handling .olm files. Need to see if I can track down proper documentation of these, but from what I have observed they seem rather simple. They are a renamed zip file composed of folders and (mostly) xml files. If a directory has emails, it seems to use the following format:

ReblochonMasque commented 2 years ago

This is great, thank you.

Of note:

TheElementalOfDestruction commented 2 years ago

Should also be noted that despite that filtering, it seems to have created (potentially) full folder structures as if it was going to write that data, but just didn't put the actual files.

ReblochonMasque commented 2 years ago

Oh my! It is leaking info like krazy! Typical microsoft "betrayal of their duty to users". Let's not use this publicly if you please. ...that makes building test data, let alone clean test data more difficult, as it cannot reasonably be extracted from a live outlook instance.

TheElementalOfDestruction commented 2 years ago

I wasn't intending to add the test file you gave me to the official testing, so no worries there. However if I can adequatly build an instance of outlook that only exists for tests, where everything it touches is intended to be public anyways, then files from that should be able to be safely added to testing.

Unfortunately this suddenly became a lot more complicated, as the format is not part of the Microsoft Open Specifications, meaning it does not have official documentation that is public nor is it guaranteed to be consistent. This will make any attempt at a parser much more of a challenge.

ReblochonMasque commented 2 years ago

Thank you! Yes, I was also thinking that a dedicated instance of outlook would be needed for this purpose. One approach could be to first build a crude/simple prototype parser, and throw a large .olm db at it to see if the number of special cases is reasonable.

TheElementalOfDestruction commented 2 years ago

I wanted to see if I could make something dead simple to parse it, just grabbing each of the tags and making anything found accesible directly, but some of the tags are lists (categories, attachments, addresses), some have properties inside the tag... Would probably need some additional code to find some of the patterns and parse them correctly. I'll probably wait for that testing environment before really getting to work on this, so that I can get a much better idea of how this should look.

ReblochonMasque commented 2 years ago

I've built a "proof of concept", mostly to convince myself, and as a learning exercise - it is clear that some mails are more straightforward than others to handle. It is easy when bodies/contents, are plain text marked up with vanilla html. What stumped me were those where the body content was what looked like the format of .docx documents - maybe there are parsers for that already? Maybe there are other formats that are hard to parse too, but I have not encountered those yet. I have not attempted to parse nested structures with forwarded e-mails, attachments, etc.

Have you got an idea how to set up a dedicated instance of outlook to generate tests?

TheElementalOfDestruction commented 2 years ago

A lot of time they are built with a form of HTML that had additional tags for formatting in word and stuff, but it renders fine as plain HTML. Not sure if these are what you are talking about, but they also sometimes have branching conditions that actually will check for word to be there to even activate, having something to fall through if it's not available and render correctly in something like a browser.

The way I would do it is to setup a computer with a fresh install of outlook designed to not have anything on it aside from information that can be public