ePADD / epadd

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.
https://www.epaddproject.org
112 stars 24 forks source link

Issue with message delimiter when importing Mbox files. #404

Closed jfarwer closed 3 years ago

jfarwer commented 3 years ago

The mbox format uses a single blank line followed by the string 'From ' (with a space) to delimit messages. However, currently, the blank line is unnecessary when importing Mbox files to ePADD as ‘relaxed parsing’ is set to true for the used email store (mstor).

Therefore if in the message text, there is a line with ‘From ‘ at the beginning, there will be two emails instead of one: One email is a truncation of the original email and one email containing the second bit. The second email probably contains garbage as all the header information (encoding etc.) is missing. When setting ‘relaxed parsing’ to false, there will still be cases where a blank line is followed by ‘From ‘ in the message text, and the parser will wrongly assume the start of a new message, but as this is less likely, the number of improperly parsed emails should be lower.