Add Parsing for .olm (Outlook for Mac) Files.

TheElementalOfDestruction commented 2 years ago

Add support for generally parsing and handling .olm files. Need to see if I can track down proper documentation of these, but from what I have observed they seem rather simple. They are a renamed zip file composed of folders and (mostly) xml files. If a directory has emails, it seems to use the following format:

Email xml file where the name is __message_attachment__{id}.xml (right now I have only seen the ID as a 6 digit number, unclear if hex or decimal).
Attachments are stored in a subfolder com.microsoft.__Attachments in files using the message id for the name, followed by an underscore and a 4 digit number, presumably the id of the attachment for the specified message.
Email xmls start with the <emails> tag, presumably allowing them to store more than one email (which are denoted by the <email> tag. Names of properties within it have, so far, been reliably observed to be in the format OPFMessageCopy{name} (so AttachmentList would become OPFMessageCopyAttachmentList, Body would become OPFMessageCopyBody, etc.).
Attachments seem to also have a full url (position in zip file) for them, however I would advise trying to have a system to first check for attachments like that but then also check for attachments in the previously observed way, only adding them if they are not listed.

ReblochonMasque commented 2 years ago

This is great, thank you.

Of note:

This is the structure observed on an .olm file produced by Microsoft Outlook For Mac 2011, version <placeholder>
The file was built via File - export - email only - filtered over one category only
- i/e other objects (contacts, tasks, calendar, etc.) were not included.
additional .olm files will be needed for validation of the data structures and for testing (my first thought is that this will be hard to get, or tedious to produce)

TheElementalOfDestruction commented 2 years ago

Should also be noted that despite that filtering, it seems to have created (potentially) full folder structures as if it was going to write that data, but just didn't put the actual files.

ReblochonMasque commented 2 years ago

Oh my! It is leaking info like krazy! Typical microsoft "betrayal of their duty to users". Let's not use this publicly if you please. ...that makes building test data, let alone clean test data more difficult, as it cannot reasonably be extracted from a live outlook instance.

TheElementalOfDestruction commented 2 years ago

I wasn't intending to add the test file you gave me to the official testing, so no worries there. However if I can adequatly build an instance of outlook that only exists for tests, where everything it touches is intended to be public anyways, then files from that should be able to be safely added to testing.

Unfortunately this suddenly became a lot more complicated, as the format is not part of the Microsoft Open Specifications, meaning it does not have official documentation that is public nor is it guaranteed to be consistent. This will make any attempt at a parser much more of a challenge.

ReblochonMasque commented 2 years ago

Thank you! Yes, I was also thinking that a dedicated instance of outlook would be needed for this purpose. One approach could be to first build a crude/simple prototype parser, and throw a large .olm db at it to see if the number of special cases is reasonable.

TheElementalOfDestruction commented 2 years ago

I wanted to see if I could make something dead simple to parse it, just grabbing each of the tags and making anything found accesible directly, but some of the tags are lists (categories, attachments, addresses), some have properties inside the tag... Would probably need some additional code to find some of the patterns and parse them correctly. I'll probably wait for that testing environment before really getting to work on this, so that I can get a much better idea of how this should look.

ReblochonMasque commented 2 years ago

I've built a "proof of concept", mostly to convince myself, and as a learning exercise - it is clear that some mails are more straightforward than others to handle. It is easy when bodies/contents, are plain text marked up with vanilla html. What stumped me were those where the body content was what looked like the format of .docx documents - maybe there are parsers for that already? Maybe there are other formats that are hard to parse too, but I have not encountered those yet. I have not attempted to parse nested structures with forwarded e-mails, attachments, etc.

Have you got an idea how to set up a dedicated instance of outlook to generate tests?

TheElementalOfDestruction commented 2 years ago

A lot of time they are built with a form of HTML that had additional tags for formatting in word and stuff, but it renders fine as plain HTML. Not sure if these are what you are talking about, but they also sometimes have branching conditions that actually will check for word to be there to even activate, having something to fall through if it's not available and render correctly in something like a browser.

The way I would do it is to setup a computer with a fresh install of outlook designed to not have anything on it aside from information that can be public

TeamMsgExtractor / msg-extractor

Add Parsing for .olm (Outlook for Mac) Files. #244