Why use latin-1 encoding in compat.py?

flipmcf commented 3 years ago

My specific 'bug' (using the term loosely) was aggravated when a user 'cleverly' used unicode for a filename.

2020-10-02 16:11:59,195 ERROR   [waitress:38][waitress-3] Exception while serving /english/news/audio-nonascii-10022020155924.html/@@stream
Traceback (most recent call last):
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/channel.py", line 350, in service
    task.service()
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/task.py", line 171, in service
    self.execute()
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/task.py", line 479, in execute
    self.write(chunk)
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/task.py", line 311, in write
    rh = self.build_response_header()
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/task.py", line 284, in build_response_header
    return tobytes(res)
  File "/home/mcfaddenm/repos/plone5.2_clean/rfasite/rfasite/eggs/waitress-1.4.4-py3.6.egg/waitress/compat.py", line 69, in tobytes
    return bytes(s, "latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 77-80: ordinal not in range(256)

When using plone to @@stream a file, Debug shows everything is fine, except byte-encoding fails because the filename contains bytes outside the latin-1 character set: " ＶＩＳＡ.mp3 "

-> return bytes(s, "latin-1")
(Pdb) type(s)
<class 'str'>
(Pdb) !s
'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Disposition: inline; filename=ＶＩＳＡ.mp3\r\nContent-Length: 778239\r\nContent-Type: audio/mpeg\r\nDate: Fri, 02 Oct 2020 20:11:59 GMT\r\nServer: waitress\r\nVia: waitress\r\nX-Frame-Options: SAMEORIGIN\r\nX-Powered-By: Zope (www.zope.org), Python (www.python.org)\r\n\r\n'

Changing that line to return bytes(s, "utf-8") makes it work just fine, but I'm not sure if it breaks some kind of HTTP Rule.

b'HTTP/1.1 200 OK\r\nAccept-Ranges: bytes\r\nContent-Disposition: inline; filename=\xef\xbc\xb6\xef\xbc\xa9\xef\xbc\xb3\xef\xbc\xa1.mp3\r\nContent-Length: 778239\r\nContent-Type: audio/mpeg\r\nDate: Fri, 02 Oct 2020 20:11:59 GMT\r\nServer: waitress\r\nVia: waitress\r\nX-Frame-Options: SAMEORIGIN\r\nX-Powered-By: Zope (www.zope.org), Python (www.python.org)\r\n\r\n'

Is simply changing to utf-8 a good fix?

Edit: I may have opened a can of worms. looks like, at least for my edgy case: https://tools.ietf.org/html/rfc2184 says to do something like: filename*=utf-8'zh-cn'\xef\xbc\xb6\xef\xbc\xa9\xef\xbc\xb3\xef\xbc\xa1.mp3

I'd much rather tell my users to use ascii only filenames, but I do think we should at least guard against the exception somehow. Encoding the entire HTTP Response to utf-8 seems to help more than hurt.

flipmcf commented 3 years ago

in 2.0.0 this would break: https://github.com/Pylons/waitress/blob/c2980c107a372f635e307ae11d5ac33c2ba57c13/src/waitress/task.py#L279

Why use latin-1? can we use utf-8?

mmerickel commented 3 years ago

HTTP headers as a general rule only support US-ASCII. PEP 3333 allows latin-1, as quoted below:

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

What I've done historically for content-disposition is used the unidecode package to convert utf-8 encoded filenames into something that fits. Unfortunately this isn't really something that waitress is doing incorrectly AFAIK.

We could try to provide better support for the RFC2047-formatted strings as you noted, but I'm not sure it'll be worth it, I have no idea what the browser support is like for that and I suspect it's better to just stick to US-ASCII.

flipmcf commented 3 years ago

Thank you for the reply. I agree that this is not the job of waitress. The filename should be properly encoded before waitress is involved.

Pylons / waitress

Why use latin-1 encoding in compat.py? #318