apache / incubator-ponymail

Apache Pony Mail (Incubating) - Email for Ponies & People
http://ponymail.incubator.apache.org/
Other
80 stars 30 forks source link

Bug: import-mbox.py fails to unescape >From lines #212

Closed sebbASF closed 7 years ago

sebbASF commented 8 years ago

The mbox format is not standardised, but most will escape a line (in the body) starting with 'From ' by prepending a '>'.

When importing such files, the '>' prefix needs to be stripped off.

Sample import that is wrong:

https://lists.apache.org/thread.html/15f6963a8102a5c3be141b6151f2c22428011c4ed60511ca2da1582b@1451547586@%3Ckerby.directory.apache.org%3E

AFAICT the problem is in the standard mailbox software which is part of Python. It looks like this escapes 'From ' on output, but does not unescape it on input.

sebbASF commented 8 years ago

Emails that are processed live by the archiver will not have escaped From lines. This is correct; the emails will be stored as-is in the database.

However the same e-mail imported from an mbox file will have escaped lines, i.e. the database copy will be escaped.

Since the escaping affects the body content, this means that fixing this is likely to affect ids and Permalinks.

Taken together with #188, this means it's currently impossible to export and re-import such mails properly.

Humbedooh commented 8 years ago

Are there not two From lines in an mbox export? there's the initial "From foo" that is required to split emails apart, and then the From: header inside. PM uses the latter for ID generation.

sebbASF commented 8 years ago

No; there is one 'From ' line which precedes each set of e-mail headers. Note the trailing space.

However anything can appear in message bodies, including lines starting with 'From '.

In general it's not possible to distinguish such lines from the message separator, so one technique is to convert 'From ' to '>From ' when writing the mail bodies in mbox files. This needs to be reverted when reading the body otherwise the content does not correspond with the original message.

PM uses the body + lid + date for id generation in the 'medium' algorithm which is what lists.a.o uses.

So if the message body is corrected by removing the leading '>', the MID will change. If it is not corrected, the message is wrong.

Obviously mbox export and import need to agree, otherwise chaos results.

N.B. Escaping 'From ' is more complicated than just prefixing it with '>'. It has to allow for a body containing: 'From ' and '>From '. But I digress.

sebbASF commented 7 years ago

It looks as though the Python email package supports from_mangling on output(*), but it seems not to do so on input which is a bit strange. Probably a bug.

(*) not used by Pony Mail

sebbASF commented 7 years ago

https://docs.python.org/3/library/mailbox.html#mbox says:

"... any occurrences of “From ” at the beginning of a line in a message body are transformed to “>From ” when storing the message, although occurrences of “>From ” are not transformed to “From ” when reading the message."

This is rather unfortunate. It's likely too late to unmangle the lines once the message has been parsed, as the >From sequence may occur anywhere, including in attachments.

There does not seem to be any way to intercept the file reader used by the parser, but one can provide a message factory to the parser and it looks as though one can use that to overload the file read() methods.

sebbASF commented 7 years ago

Managed to find a way to patch the current behaviour