Closed sebbASF closed 4 years ago
Note that a successful decode will change type(body) to str, so the first charset will be used if it succeeds. However if the first charset fails, utf-8 will be used, and a subsequent charset (if any) won't be tried.
If there are mulitple charsets, then the result is unpredictable as it depends on the set traversal order.
I've just found a couple of messages that trigger this behaviour:
Messages with characters that parse differently (or fail to parse) in one of the charsets can be affected.
N.B. This bug means that the generated MID will vary (unless the generator does not use the parsed body).
Note: the same MID will be generated each time if the same message is processed repeatedly during the same test run. It looks like whatever determines the ordering of Set entries is established during process startup.
The generators that use the message body are: medium, medium_original, cluster, legacy. [The full generator does not, however it is susceptible to changes in the mail library parsing.]
Currently, the decoding tries to use all possible character sets found in the message, not necessarily ones pertaining to the body itself. It also assumes that the default is UTF-8. Both are IMHO wrong, and are handled differently in Foal:
Note that decode is only invoked if type(body) is bytes, and a successful decode will change body to a string. So the decoding uses the first character set only. If that fails, it uses UTF-8.
I agree the current approach is wrong, however changing it will change the output of any generators which depend on the decoded body.
It's vital to ensure that the generators continue to produce the same output for the same input.
I think it should be safe to change the code to use the content-type for the current part, falling back to UTF-8 on error. This should fix the output to the most appropriate of the charsets, and should match at least some of the existing generated MIDs. This is not ideal, but is no worse than the current situation.
The method msgbody returns a string. This is converted here:
https://github.com/apache/incubator-ponymail/blob/ec0a76eb8b6836a3675800a65829a48fe90156f9/tools/archiver.py#L255
There are some problems with this: 1) it's only needed if type(body) is bytes 2)
if multiple charsets are present it will try them all. If the last one fails, this will override any successful earlier decodes3) pm_charsets() uses a set to hold the result of calling msg.get_charsets(). This does not preserve the order. Does it even make sense to look at all the charsets associated with a message? If so, then surely each part should be treated separately.