Closed evgeni closed 12 years ago
The particular encoding aside, why should WebOb ignore characters that someone is trying to attach to a header? I'd personally much rather that it raised an exception, like it does now, instead of silently ignoring an error.
Then it should "just" accept ascii, not latin1 (read: your point is valid ;))
I think you're right, by my reading of rfcs 2616/822, e.g.
section 4.2 of RFC 2616:
HTTP header fields, which include general-header (section 4.5),
request-header (section 5.3), response-header (section 6.2), and
entity-header (section 7.1) fields, follow the same generic format as
that given in Section 3.1 of RFC 822 [9].
section 3.1.2 of RFC 822:
Once a field has been unfolded, it may be viewed as being com-
posed of a field-name followed by a colon (":"), followed by a
field-body, and terminated by a carriage-return/line-feed.
The field-name must be composed of printable ASCII characters
(i.e., characters that have values between 33. and 126.,
decimal, except colon). The field-body may be composed of any
ASCII characters, except CR or LF. (While CR and/or LF may be
present in the actual text, they are removed by the action of
unfolding the field.)
But I have to plead ignorance a bit here. I wish there was some canonical source of information for this stuff that I felt like I trusted. ;-)
If we encode etags as ('ascii', 'ignore') we'll end up with etag matches when there were none. Also, "Errors should never pass silently". In the end, if the user sets such a header to non-latin-1 unicode they should either do the encoding you described themselves or get the error so that they know that they are doing it and the headers cannot encode such values.
On the other hand the encoding of headers indeed seems to be ASCII (and only a subset of that is allowed anyway), I'm not sure where the assumption of headers being latin-1 comes, I recall it being generally accepted when the WSGI for py3 was being discussed, but I'm not sure.
I'd suggest asking this on the mailing list to see if anyone knows better.
The spec has this bit:
The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT may contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2
I can't tell if what webob does is a correct interpretation of that rule or not. (We still don't want RFC 2047 anyway, because browsers don't seem to implement it IIRC).
This suggests that current code is correct: http://trac.tools.ietf.org/wg/httpbis/trac/ticket/74
I concur with Sergey that the version of RFC2616 referenced above does seem to imply that header values can be latin-1 (iso-8859-1) encoded. It's defensible.
I had a discussion with Dan Connolly on IRC (he is a contributor to the spec), and although he regrets naming latin-1 as the TEXT default, he believes we are obligated to decode latin-1-encoded headers sent to us. For this reason, I personally think it's reasonable to also encode headers as latin-1 (as they should/will be decodeable by WebOb). Dan has some cognitive dissonance though, and claims it's likely we should fail when asked to encode anything that falls outside ASCII. I'm apt to just make it simple and always decode from and encode to latin-1, though.
Fair enough. Just don't fail silently.
Closing this issue; no fix required, I think. If you want to pass characters that fall outside latin-1 in headers, you'll have to encode them to ascii (or latin-1) at the application layer, and decode them there too.
The fix for http://trac.pythonpaste.org/pythonpaste/ticket/251 introduced the following code:
As unicode >> latin1 (latin1 does not even contain a simple €) and I don't understand why headers should be latin-1 anyways (my understanding is they have to be ascii), the following code breaks badly:
I would suggest
value = value.encode('ascii', 'ignore')
here (just to be safe, noone should actually pass non-ascii here, but you never know).