Open TheElementalOfDestruction opened 2 years ago
This is great, thank you.
Of note:
.olm
file produced by Microsoft Outlook For Mac 2011, version <placeholder>
File - export - email only - filtered over one category only
.olm
files will be needed for validation of the data structures and for testing (my first thought is that this will be hard to get, or tedious to produce)Should also be noted that despite that filtering, it seems to have created (potentially) full folder structures as if it was going to write that data, but just didn't put the actual files.
Oh my! It is leaking info like krazy! Typical microsoft "betrayal of their duty to users". Let's not use this publicly if you please. ...that makes building test data, let alone clean test data more difficult, as it cannot reasonably be extracted from a live outlook instance.
I wasn't intending to add the test file you gave me to the official testing, so no worries there. However if I can adequatly build an instance of outlook that only exists for tests, where everything it touches is intended to be public anyways, then files from that should be able to be safely added to testing.
Unfortunately this suddenly became a lot more complicated, as the format is not part of the Microsoft Open Specifications, meaning it does not have official documentation that is public nor is it guaranteed to be consistent. This will make any attempt at a parser much more of a challenge.
Thank you!
Yes, I was also thinking that a dedicated instance of outlook would be needed for this purpose.
One approach could be to first build a crude/simple prototype parser, and throw a large .olm
db at it to see if the number of special cases is reasonable.
I wanted to see if I could make something dead simple to parse it, just grabbing each of the tags and making anything found accesible directly, but some of the tags are lists (categories, attachments, addresses), some have properties inside the tag... Would probably need some additional code to find some of the patterns and parse them correctly. I'll probably wait for that testing environment before really getting to work on this, so that I can get a much better idea of how this should look.
I've built a "proof of concept", mostly to convince myself, and as a learning exercise - it is clear that some mails are more straightforward than others to handle. It is easy when bodies/contents, are plain text marked up with vanilla html. What stumped me were those where the body content was what looked like the format of .docx documents - maybe there are parsers for that already? Maybe there are other formats that are hard to parse too, but I have not encountered those yet. I have not attempted to parse nested structures with forwarded e-mails, attachments, etc.
Have you got an idea how to set up a dedicated instance of outlook to generate tests?
A lot of time they are built with a form of HTML that had additional tags for formatting in word and stuff, but it renders fine as plain HTML. Not sure if these are what you are talking about, but they also sometimes have branching conditions that actually will check for word to be there to even activate, having something to fall through if it's not available and render correctly in something like a browser.
The way I would do it is to setup a computer with a fresh install of outlook designed to not have anything on it aside from information that can be public
Add support for generally parsing and handling .olm files. Need to see if I can track down proper documentation of these, but from what I have observed they seem rather simple. They are a renamed zip file composed of folders and (mostly) xml files. If a directory has emails, it seems to use the following format:
__message_attachment__{id}.xml
(right now I have only seen the ID as a 6 digit number, unclear if hex or decimal).com.microsoft.__Attachments
in files using the message id for the name, followed by an underscore and a 4 digit number, presumably the id of the attachment for the specified message.<emails>
tag, presumably allowing them to store more than one email (which are denoted by the<email>
tag. Names of properties within it have, so far, been reliably observed to be in the formatOPFMessageCopy{name}
(soAttachmentList
would becomeOPFMessageCopyAttachmentList
,Body
would becomeOPFMessageCopyBody
, etc.).