Closed siddhesh closed 1 year ago
cc @codonell
This is odd. According to the docs, email.utils.formataddr
defaults to utf-8
for encoding the name.
>>> import email.utils
>>> email.utils.formataddr(('John Doe', 'john@example.com'))
'John Doe <john@example.com>'
>>> email.utils.formataddr(('John Doe 😅', 'john@example.com'))
'=?utf-8?b?Sm9obiBEb2Ug8J+YhQ==?= <john@example.com>'
However, the email address must be ASCII per RFCs...or so says the code:
https://github.com/python/cpython/blob/v3.11.0/Lib/email/utils.py#L90-L91
So I'm not sure what's going on here. The email address looks like ASCII so how are we getting unicode in here? Any chance we could get the last few lines of the traceback?
@stephenfin The From:
in the email looks incorrect:
From: AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?= @sourceware.org
This looks like an invalid RFC 2047 format sequence.
It looks like the submitter was playing with their spelling in their mail client and the client sent invalid RFC 2047 output to the mailing list.
Patchwork could attempt to protect from this because otherwise we get the following:
>>> email.utils.parseaddr('AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?= @sourceware.org')
('', 'AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?=@sourceware.org')
>>>
The RFC2047 tokens are now in the email and I'd expect we might decode that to UFT-8 via parseheader
at some point and then when converting to ASCII it fails.
Thoughts?
The submitter has submitted with 3 distinct emails on the mailing list: https://sourceware.org/pipermail/gdb-patches/2023-January/thread.html
At least one of the configurations has an incorrect email address encoded in RFC2047: https://sourceware.org/pipermail/gdb-patches/2023-January/195275.html
The most recent UTF-8 name with ascii email looks OK: https://sourceware.org/pipermail/gdb-patches/2023-January/196124.html
And patchwork handles it correctly: https://patchwork.sourceware.org/project/gdb/patch/20230124172121.2425214-1-ahajkova@redhat.com/
@siddhesh Did you see the backtrace while looking at the logs? Do you think the incorrectly encoded email to the mailing list could be the cause? Was it git send-email you think?
@siddhesh Did you see the backtrace while looking at the logs? Do you think the incorrectly encoded email to the mailing list could be the cause?
Yes, in fact I discussed this on the overseers IRC channel and we spotted the invalid From in the incoming email.
Was it git send-email you think?
I doubt it, maybe I'll just ask Alexandra.
So what do we want to do here? I'm reluctant to rewrite the email address to strip invalid (i.e. non-ASCII) characters from it, but I don't know what else we could do once the value is in the system. We could catch this issue at the parser-level (i.e. ensure the email address in the From
header is ASCII and drop the mail if not) but that would mean the patch would never be ingested by Patchwork. It's tricky...
So what do we want to do here? I'm reluctant to rewrite the email address to strip invalid (i.e. non-ASCII) characters from it, but I don't know what else we could do once the value is in the system. We could catch this issue at the parser-level (i.e. ensure the email address in the
From
header is ASCII and drop the mail if not) but that would mean the patch would never be ingested by Patchwork. It's tricky...
Let me make my case here for two outcomes:
I'm strongly inclined to reject such emails. The reason is specific to this case, because this particular API has been the subject of CVEs and attacks against mail processing systems, therefore I think we need to reject invalid emails. There may be other cases where we can be conservative and accept poorly formed input, but in this case it is not valuable and possibly impacts all other subsystems that process that email.
System administrators of patchwork review logs looking for backtraces, and we should aim to reduce the number of backtraces, so making this a diagnostic saying that a given message-id has been rejected for a specific reason, that would be good for me.
Thoughts?
I reached out to Alexandra to understand how that patch was generated but I haven't heard from her yet. That's an orthogonal thing though (that program ought to be fixed too), I'm inclined towards rejecting too, at the cost of patchwork missing those patches.
Here's the example patch showing this breakage:
https://patchwork.sourceware.org/project/gdb/patch/20230104113909.1395263-1-ahajkova@redhat.com/
Here the patch itself is not in unicode but the submitter name is, which seems to result in a crash:
The crash is in the patch mbox and series mbox pages:
https://patchwork.sourceware.org/project/gdb/patch/20230104113909.1395263-1-ahajkova@redhat.com/mbox/