getpatchwork / patchwork

Patchwork is a web-based patch tracking system designed to facilitate the contribution and management of contributions to an open-source project.
http://jk.ozlabs.org/projects/patchwork/
GNU General Public License v2.0
276 stars 82 forks source link

Patchwork submitter name parsing seems broken #512

Closed siddhesh closed 1 year ago

siddhesh commented 1 year ago

Here's the example patch showing this breakage:

https://patchwork.sourceware.org/project/gdb/patch/20230104113909.1395263-1-ahajkova@redhat.com/

Here the patch itself is not in unicode but the submitter name is, which seems to result in a crash:

Traceback (most recent call last):
  File "/path/to/libs/lib/python3/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/path/to/libs/lib/python3/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/path/to/patchwork/views/series.py", line 17, in series_mbox
    response.write(series_to_mbox(series))
  File "/path/to/patchwork/views/utils.py", line 197, in series_to_mbox
    mbox.append(patch_to_mbox(patch))
  File "/path/to/patchwork/views/utils.py", line 83, in _submission_to_mbox
    mail['X-Patchwork-Submitter'] = email.utils.formataddr(
  File "/usr/lib64/python3.9/email/utils.py", line 91, in formataddr
    address.encode('ascii')

The crash is in the patch mbox and series mbox pages:

https://patchwork.sourceware.org/project/gdb/patch/20230104113909.1395263-1-ahajkova@redhat.com/mbox/

siddhesh commented 1 year ago

cc @codonell

stephenfin commented 1 year ago

This is odd. According to the docs, email.utils.formataddr defaults to utf-8 for encoding the name.

>>> import email.utils
>>> email.utils.formataddr(('John Doe', 'john@example.com'))
'John Doe <john@example.com>'
>>> email.utils.formataddr(('John Doe 😅', 'john@example.com'))
'=?utf-8?b?Sm9obiBEb2Ug8J+YhQ==?= <john@example.com>'

However, the email address must be ASCII per RFCs...or so says the code:

https://github.com/python/cpython/blob/v3.11.0/Lib/email/utils.py#L90-L91

So I'm not sure what's going on here. The email address looks like ASCII so how are we getting unicode in here? Any chance we could get the last few lines of the traceback?

codonell commented 1 year ago

@stephenfin The From: in the email looks incorrect:

From: AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?= @sourceware.org

This looks like an invalid RFC 2047 format sequence.

It looks like the submitter was playing with their spelling in their mail client and the client sent invalid RFC 2047 output to the mailing list.

Patchwork could attempt to protect from this because otherwise we get the following:

>>> email.utils.parseaddr('AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?= @sourceware.org')
('', 'AlexandraH =?utf-8?b?w6E=?= jkov =?utf-8?b?w6E=?=@sourceware.org')
>>> 

The RFC2047 tokens are now in the email and I'd expect we might decode that to UFT-8 via parseheader at some point and then when converting to ASCII it fails.

Thoughts?

codonell commented 1 year ago

The submitter has submitted with 3 distinct emails on the mailing list: https://sourceware.org/pipermail/gdb-patches/2023-January/thread.html

At least one of the configurations has an incorrect email address encoded in RFC2047: https://sourceware.org/pipermail/gdb-patches/2023-January/195275.html

The most recent UTF-8 name with ascii email looks OK: https://sourceware.org/pipermail/gdb-patches/2023-January/196124.html

And patchwork handles it correctly: https://patchwork.sourceware.org/project/gdb/patch/20230124172121.2425214-1-ahajkova@redhat.com/

@siddhesh Did you see the backtrace while looking at the logs? Do you think the incorrectly encoded email to the mailing list could be the cause? Was it git send-email you think?

siddhesh commented 1 year ago

@siddhesh Did you see the backtrace while looking at the logs? Do you think the incorrectly encoded email to the mailing list could be the cause?

Yes, in fact I discussed this on the overseers IRC channel and we spotted the invalid From in the incoming email.

Was it git send-email you think?

I doubt it, maybe I'll just ask Alexandra.

stephenfin commented 1 year ago

So what do we want to do here? I'm reluctant to rewrite the email address to strip invalid (i.e. non-ASCII) characters from it, but I don't know what else we could do once the value is in the system. We could catch this issue at the parser-level (i.e. ensure the email address in the From header is ASCII and drop the mail if not) but that would mean the patch would never be ingested by Patchwork. It's tricky...

codonell commented 1 year ago

So what do we want to do here? I'm reluctant to rewrite the email address to strip invalid (i.e. non-ASCII) characters from it, but I don't know what else we could do once the value is in the system. We could catch this issue at the parser-level (i.e. ensure the email address in the From header is ASCII and drop the mail if not) but that would mean the patch would never be ingested by Patchwork. It's tricky...

Let me make my case here for two outcomes:

I'm strongly inclined to reject such emails. The reason is specific to this case, because this particular API has been the subject of CVEs and attacks against mail processing systems, therefore I think we need to reject invalid emails. There may be other cases where we can be conservative and accept poorly formed input, but in this case it is not valuable and possibly impacts all other subsystems that process that email.

System administrators of patchwork review logs looking for backtraces, and we should aim to reduce the number of backtraces, so making this a diagnostic saying that a given message-id has been rejected for a specific reason, that would be good for me.

Thoughts?

siddhesh commented 1 year ago

I reached out to Alexandra to understand how that patch was generated but I haven't heard from her yet. That's an orthogonal thing though (that program ought to be fixed too), I'm inclined towards rejecting too, at the cost of patchwork missing those patches.