Pylons / webob

WSGI request and response objects
https://webob.org/
434 stars 187 forks source link

don't try to encode unicode as latin1 #11

Closed evgeni closed 12 years ago

evgeni commented 12 years ago

The fix for http://trac.pythonpaste.org/pythonpaste/ticket/251 introduced the following code:

    if isinstance(value, text_type) and not PY3:
        value = value.encode('latin-1')

As unicode >> latin1 (latin1 does not even contain a simple €) and I don't understand why headers should be latin-1 anyways (my understanding is they have to be ascii), the following code breaks badly:

>>> from webob import Response
>>> resp = Response()
>>> resp.etag = u'€'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "webob/descriptors.py", line 138, in fset
    hset(r, val)
  File "webob/descriptors.py", line 112, in fset
    value = value.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac' in position 1: ordinal not in range(256)

I would suggest value = value.encode('ascii', 'ignore') here (just to be safe, noone should actually pass non-ascii here, but you never know).

mcdonc commented 12 years ago

The particular encoding aside, why should WebOb ignore characters that someone is trying to attach to a header? I'd personally much rather that it raised an exception, like it does now, instead of silently ignoring an error.

evgeni commented 12 years ago

Then it should "just" accept ascii, not latin1 (read: your point is valid ;))

mcdonc commented 12 years ago

I think you're right, by my reading of rfcs 2616/822, e.g.

section 4.2 of RFC 2616:

HTTP header fields, which include general-header (section 4.5),
request-header (section 5.3), response-header (section 6.2), and
entity-header (section 7.1) fields, follow the same generic format as
that given in Section 3.1 of RFC 822 [9].

section 3.1.2 of RFC 822:

    Once a field has been unfolded, it may be viewed as being com-
    posed of a field-name followed by a colon (":"), followed by a
    field-body, and  terminated  by  a  carriage-return/line-feed.
    The  field-name must be composed of printable ASCII characters
    (i.e., characters that  have  values  between  33.  and  126.,
    decimal, except colon).  The field-body may be composed of any
    ASCII characters, except CR or LF.  (While CR and/or LF may be
    present  in the actual text, they are removed by the action of
    unfolding the field.)

But I have to plead ignorance a bit here. I wish there was some canonical source of information for this stuff that I felt like I trusted. ;-)

maluke commented 12 years ago

If we encode etags as ('ascii', 'ignore') we'll end up with etag matches when there were none. Also, "Errors should never pass silently". In the end, if the user sets such a header to non-latin-1 unicode they should either do the encoding you described themselves or get the error so that they know that they are doing it and the headers cannot encode such values.

On the other hand the encoding of headers indeed seems to be ASCII (and only a subset of that is allowed anyway), I'm not sure where the assumption of headers being latin-1 comes, I recall it being generally accepted when the WSGI for py3 was being discussed, but I'm not sure.

I'd suggest asking this on the mailing list to see if anyone knows better.

maluke commented 12 years ago

The spec has this bit:

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT may contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

I can't tell if what webob does is a correct interpretation of that rule or not. (We still don't want RFC 2047 anyway, because browsers don't seem to implement it IIRC).

This suggests that current code is correct: http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74

mcdonc commented 12 years ago

I concur with Sergey that the version of RFC2616 referenced above does seem to imply that header values can be latin-1 (iso-8859-1) encoded. It's defensible.

mcdonc commented 12 years ago

I had a discussion with Dan Connolly on IRC (he is a contributor to the spec), and although he regrets naming latin-1 as the TEXT default, he believes we are obligated to decode latin-1-encoded headers sent to us. For this reason, I personally think it's reasonable to also encode headers as latin-1 (as they should/will be decodeable by WebOb). Dan has some cognitive dissonance though, and claims it's likely we should fail when asked to encode anything that falls outside ASCII. I'm apt to just make it simple and always decode from and encode to latin-1, though.

dckc commented 12 years ago

Fair enough. Just don't fail silently.

mcdonc commented 12 years ago

Closing this issue; no fix required, I think. If you want to pass characters that fall outside latin-1 in headers, you'll have to encode them to ascii (or latin-1) at the application layer, and decode them there too.