Open Humbedooh opened 7 years ago
If a clustered setup can generate copies of e-mails that are not identical, then surely that needs to be fixed rather than munging the original e-mail?
It is beyond our projects ability to fix qmail and any other MTA that sends erroneous email.
However correcting e-mails changes the hash and therefore the Permalink, which can affect existing databases. If a database is reloaded, it looks like any existing entries with incorrect EOLs will get new Permalinks.
Depends on where it's reloaded from, I'd think. Damned if you do, damned if you don't. I don't think fixing qmail (or changing postfix) is a viable option here. It seems we have to weigh duplicate emails against the risk of potentially losing a permalink on a reload from mbox files. I'd be in favor of making the correction an argument for the archiver, so people can choose to use it or not.
I'm not sure I understand the behaviour of the redundant setup.
I assume that if the same message is presented to the main and backup archivers that they will generate the same ID hash, and therefore only one copy of the message will be in the database.
Therefore the message that is presented to the two archivers needs to be the same.
There is only a single input message from the mailing list, so any processing that needs to be done to fix CRLF must be done as part of both paths. Perhaps this is not always happening?
I don't think it's possible to fix this in the archiver. AFAICT, the archiver only sees messages with lines ending in LF. So a message ending in \n\n may have been a valid message with an extra blank line.
It would seem that when some (older) MTAs send out email, they do not conform to RFC-2821 about newlines. From the RFC, it is stated that:
Case in point: qmail sometimes will send an email using only LF instead of CRLF. This is then corrected to CRLF by postfix, but has the disadvantage in clustered setups that one archiver may receive the original input while the next gets the corrected one. The difference there is but a single added newline character, but that is enough to cause two distinct IDs being generated.
Short of fixing all MTAs, the best solution seems to be detecting any STDIN that ends in a double newline and, if found, crop the last one out before archiving.
The fix seems to be as simple as (in archiver.py, line 580-ish):
I'll investigate further and implement a solution when I am satisfied this will resolve the issue.