Archiving emails in violation of RFC-2821 line-endings may result in multiple emails on a redundant setup

Humbedooh commented 7 years ago

It would seem that when some (older) MTAs send out email, they do not conform to RFC-2821 about newlines. From the RFC, it is stated that:

   In addition, the appearance of "bare" "CR" or "LF" characters in text
   (i.e., either without the other) has a long history of causing
   problems in mail implementations and applications that use the mail
   system as a tool.  SMTP client implementations MUST NOT transmit
   these characters except when they are intended as line terminators
   and then MUST, as indicated above, transmit them only as a <CRLF>
   sequence.

Case in point: qmail sometimes will send an email using only LF instead of CRLF. This is then corrected to CRLF by postfix, but has the disadvantage in clustered setups that one archiver may receive the original input while the next gets the corrected one. The difference there is but a single added newline character, but that is enough to cause two distinct IDs being generated.

Short of fixing all MTAs, the best solution seems to be detecting any STDIN that ends in a double newline and, if found, crop the last one out before archiving.

The fix seems to be as simple as (in archiver.py, line 580-ish):

    if msgstring[-2:] == b'\n\n':
        msgstring = msgstring[:-1]

I'll investigate further and implement a solution when I am satisfied this will resolve the issue.

sebbASF commented 7 years ago

If a clustered setup can generate copies of e-mails that are not identical, then surely that needs to be fixed rather than munging the original e-mail?

Humbedooh commented 7 years ago

It is beyond our projects ability to fix qmail and any other MTA that sends erroneous email.

sebbASF commented 7 years ago

However correcting e-mails changes the hash and therefore the Permalink, which can affect existing databases. If a database is reloaded, it looks like any existing entries with incorrect EOLs will get new Permalinks.

Humbedooh commented 7 years ago

Depends on where it's reloaded from, I'd think. Damned if you do, damned if you don't. I don't think fixing qmail (or changing postfix) is a viable option here. It seems we have to weigh duplicate emails against the risk of potentially losing a permalink on a reload from mbox files. I'd be in favor of making the correction an argument for the archiver, so people can choose to use it or not.

sebbASF commented 7 years ago

I'm not sure I understand the behaviour of the redundant setup.

I assume that if the same message is presented to the main and backup archivers that they will generate the same ID hash, and therefore only one copy of the message will be in the database.

Therefore the message that is presented to the two archivers needs to be the same.

There is only a single input message from the mailing list, so any processing that needs to be done to fix CRLF must be done as part of both paths. Perhaps this is not always happening?

I don't think it's possible to fix this in the archiver. AFAICT, the archiver only sees messages with lines ending in LF. So a message ending in \n\n may have been a valid message with an extra blank line.

apache / incubator-ponymail

Archiving emails in violation of RFC-2821 line-endings may result in multiple emails on a redundant setup #394