falconry / falcon

The no-magic web data plane API and microservices framework for Python developers, with a focus on reliability, correctness, and performance at scale.
https://falcon.readthedocs.io/en/stable/
Apache License 2.0
9.51k stars 937 forks source link

Assume ISO-8859-1 (instead of UTF-8) encoding for ASGI HTTP headers #1911

Closed n0w4k closed 3 years ago

n0w4k commented 3 years ago

Currently, headers in Falcon's ASGI package will be decoded using the Python's default UTF-8 decoding. For instance: https://github.com/falconry/falcon/blob/933851183ed9cce2d3411f890dca3e19385480f6/falcon/asgi/request.py#L847 and https://github.com/falconry/falcon/blob/933851183ed9cce2d3411f890dca3e19385480f6/falcon/asgi/request.py#L897

When headers have an encoding different from UTF-8, this leads to UnicodeDecodeError. An option to use different charsets here would be much appreciated.

Even though headers are supposed to be plain ASCII where it doesn't matter, in reality, there are still headers sent which contain characters especially from the ISO-8859-1 charset. Compare also RFC 7230, section 3.2.4:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII].

open-collective-bot[bot] commented 3 years ago

Hi :wave:,

Thanks for using Falcon. The large amount of time and effort needed to maintain the project and develop new features is not sustainable without the generous financial support of community members like you.

Please consider helping us secure the future of the Falcon framework with a one-time or recurring donation.

Thank you for your support!

vytas7 commented 3 years ago

Hi @n0w4k ! And thanks for filing this.

I'll schedule this for 3.1 to ensure we make a decision on this.

I've tried digging into this quickly. It seems that, for instance, WSGI (PEP 3333) implies that headers should be treated as latin1 (ISO-8859-1), OTOH it's clearly articulated only for response headers, not request header parsing. The ASGI spec conveniently evades this issue by simply providing byte headers, as discussed in your description.

The popular Gunicorn server also opts to treat headers as latin1 (ISO-8859-1) encoded, see its bytes_to_str() function.

In that light, I'm inclined to think that we probably don't want to support custom header encodings, but OTOH, it is a bug in Falcon's asgi.App implementation that we assume UTF-8, and not ISO-8859-1.

@kgriffs thoughts?

vytas7 commented 3 years ago

@n0w4k Just to confirm, would defaulting to ISO-8859-1 solve your problem at hand, or are you dealing with other incompatible encodings as well?

n0w4k commented 3 years ago

Thanks for the reply, @vytas7. I can confirm that all the bytes mentioned in the UnicodeDecodeError exceptions that we checked (mostly umlauts) could be decoded using ISO-8859-1, so yes, I am pretty optimistic that defaulting to it would solve our problem. Furthermore, we never had this issue using WSGI (using Falcon 2).

vytas7 commented 3 years ago

Right, thanks!

It looks like almost all RFCs, WSGI servers and ASGI frameworks assume latin1 by default, this was simply an oversight on our end. I've rescheduled this to the first bugfix release (we have another edge case incompatibility already slated for the release in question).

kgriffs commented 3 years ago

Released: https://github.com/falconry/falcon/releases/tag/3.0.1