Bug: Archiver#msgbody: poor handling of conversion to string

apache / incubator-ponymail

Apache Pony Mail (Incubating) - Email for Ponies & People

http://ponymail.incubator.apache.org/

Other

80 stars 30 forks source link

Bug: Archiver#msgbody: poor handling of conversion to string #463

Closed sebbASF closed 4 years ago

sebbASF commented 6 years ago

The method msgbody returns a string. This is converted here:

https://github.com/apache/incubator-ponymail/blob/ec0a76eb8b6836a3675800a65829a48fe90156f9/tools/archiver.py#L255

There are some problems with this: 1) it's only needed if type(body) is bytes 2) ~~if multiple charsets are present it will try them all. If the last one fails, this will override any successful earlier decodes~~ 3) pm_charsets() uses a set to hold the result of calling msg.get_charsets(). This does not preserve the order. Does it even make sense to look at all the charsets associated with a message? If so, then surely each part should be treated separately.

sebbASF commented 6 years ago

Note that a successful decode will change type(body) to str, so the first charset will be used if it succeeds. However if the first charset fails, utf-8 will be used, and a subsequent charset (if any) won't be tried.

If there are mulitple charsets, then the result is unpredictable as it depends on the set traversal order.

sebbASF commented 4 years ago

I've just found a couple of messages that trigger this behaviour:

http://mailarchive-vm.apache.org/mod_mbox/httpd-dev/200709.mbox/raw/%3c14ee094e0709041258o68a6f771ma57ff5b4b77fa4d6@mail.gmail.com%3e

http://mailarchive-vm.apache.org/mod_mbox/httpd-dev/200709.mbox/raw/%3c14ee094e0709050909o647cfc98v9f428fc04d6d283d@mail.gmail.com%3e

Messages with characters that parse differently (or fail to parse) in one of the charsets can be affected.

N.B. This bug means that the generated MID will vary (unless the generator does not use the parsed body).

sebbASF commented 4 years ago

Note: the same MID will be generated each time if the same message is processed repeatedly during the same test run. It looks like whatever determines the ordering of Set entries is established during process startup.

The generators that use the message body are: medium, medium_original, cluster, legacy. [The full generator does not, however it is susceptible to changes in the mail library parsing.]

Humbedooh commented 4 years ago

Currently, the decoding tries to use all possible character sets found in the message, not necessarily ones pertaining to the body itself. It also assumes that the default is UTF-8. Both are IMHO wrong, and are handled differently in Foal:

US-ASCII should be assumed default as per RFC822. All other character sets MUST be accompanied with a Content-Type encoding sub-header.
Only the message part or its parent should be looked at for character sets, and parent should only be fallback if the message part does not have a valid encoding sub-headeer.

sebbASF commented 4 years ago

Note that decode is only invoked if type(body) is bytes, and a successful decode will change body to a string. So the decoding uses the first character set only. If that fails, it uses UTF-8.

I agree the current approach is wrong, however changing it will change the output of any generators which depend on the decoded body.

It's vital to ensure that the generators continue to produce the same output for the same input.

I think it should be safe to change the code to use the content-type for the current part, falling back to UTF-8 on error. This should fix the output to the most appropriate of the charsets, and should match at least some of the existing generated MIDs. This is not ideal, but is no worse than the current situation.