Bug report: either a unseen bug or fundamental misunderstanding of Unicode, latin1, utf-8

bottlepy / bottle

bottle.py is a fast and simple micro-framework for python web-applications.

http://bottlepy.org/

MIT License

8.37k stars 1.46k forks source link

Bug report: either a unseen bug or fundamental misunderstanding of Unicode, latin1, utf-8 #1217

Closed kouritron closed 4 years ago

kouritron commented 4 years ago

my god just look at this: https://github.com/bottlepy/bottle/blob/master/bottle.py#L2211

First it asserts we have a 'unicode' or python3 str, then does this: return s.encode('latin1').decode( ... something which is probably 'utf8' ... )

really ?? turn s into a latin1 bit pattern then try to interpret it as utf-8. this is contradiction in itself. its like turning something into base64 then trying to interpret it as a hex encoded string.

latin1 and utf-8 are not compatible, they are both supersets of ASCII, latin1 grabs the upper half of the 1-byte address space and essentially extends the ASCII table by a few more french, spanish characters, ....

UTF-8 is a variable length encoding that supports a potentially unlimited extension to the ASCII table. when UTF-8 turns on the most significant bit, it means look at the next byte (at least).

------------------------------------------------------ here is how PEP-3333 should've been interpreted -------------- (rules for everyone).

Limit the bit patterns you write into HTTP headers, to only those expressible by at least latin1 and really you should limit yourself to just ASCII. Seriously what the bleep, do you need french characters in HTTP headers for ? it would most likely break most servers.
feel free to work with unicode in your application side, but turn it into bytes before writing to any socket for the "body" portion of HTTP communication.

-------------- (rules for framework/application).
- stick to ASCII for HTTP headers (should even assert nothing beyond ASCII is written to any socket anywhere before the point of no return.)

tell users (app devs) to feel free to use unicode and whatever language they want in text processing. but either give the framework a bytes object when responding, or expect the framework to turn it into one, using utf-8 encoding.

------------------------------------------------------ Why has it not been a completely show stopper flaw until now? probably because you already do what I said, in that nothing beyond ASCII is being written to headers.

this line: return s.encode('latin1').decode( ... something which is probably 'utf8' ... )

does not raise an exception if "latin1" is completely unnecessary, in other words: s.encode('latin1') really is equal to s.encode('ASCII') if 's' contains nothing beyond the first 128 symbols of unicode (AKA the ASCII table)

try putting some french characters in s and you get exception right away.

defnull commented 4 years ago

Some headers (e.g. cookies) and the request path do allow non-ASCII characters, but HTTP does not know about unicode and does not define an encoding, and rfc2047 annotated headers are actually pretty rare. All modern browsers will just transmit utf8 encoded bytes instead. (Actually, browsers will use the same encoding the HTML page containing the link or from was encoded with, but that defaults to utf8 nowadays).

PEP-3333 requires all these strings to be passed to the WSGI application as 'str', which is unicode in Python 3. To do that, the WSGI server implementation has to decode the bytes that came through the wire with some encoding. This encoding is latin1 (aka ISO-8859-1) because that is defined for all possible byte values (unlike ASCII, which is a 7bit encoding and cannot encode byte values higher than 127). latin1 is not the correct encoding though. An utf8 encoded byte string decoded with latin1 contains broken glyphs. You probably have seen that already (e.g. ö becomes Ã¶). To get the correct text, you have to re-encode the unicode string with latin1 to get the original byte value back, then decode it with the correct encoding (e.g. utf8). WSGI does not know the correct encoding, so this has to happen at application (or framework) level.

The exact same problem exists on the way back (response headers), and it is solved in the exact same way. Bottle will encode headers with utf8 to get the bytes that should actually be transmitted, and then decodes that with latin1 to please the WSGI spec. The WSGI server implementation will blindly encode these unicode strings again with latin1, which produces the same bytes we had before, thus happily writing utf8 encoded bytes to the socket.

You could also say that WSGI uses latin1 as a transparent encoding to store byte data in unicode strings without loosing any information, to allow re-encoding with the correct encoding later. Byte strings in disguise. You cannot use ASCII for that because that is invalid for byte values larger than 127. You also cannot use utf8 because some byte sequences would be invalid in that encoding and produce errors.

defnull commented 4 years ago

Seriously what the bleep, do you need french characters in HTTP headers for? It would most likely break most servers.

Header names are limited to ASCII. Header values (or the request path) are not, and no server would complain if you send a French name in a cookie or a query parameter. For the server, everything is bytes. And if your application chokes on a non-english name, in 2020, you did something wrong. The "ASCII text only" mindset is a relict from the past.

defnull commented 4 years ago

Closed. Not a bug. This is how WSGI (pep-3333) works, unfortunately. Mostly for historic reasons (byte strings were very limited in the early days).

kouritron commented 4 years ago

Dont mean to beat this issue excessively so if you want to stop talking about it just say so. I just wanted to bring this to your attention.

I am also sure you know way more about PEP-3333 which is something i started reading a few days ago. Notwithstanding,

unicode_string.encode('latin1').decode( 'utf8' )

makes no sense whatsoever. unless you've got weird hack that intends to raise exception here.

it makes no sense to turn something into a latin1 (aka ISO-8859-1) encoded bit pattern and then interpret it as UTF-8, which is what that code does.

a = "liberté"
a.encode("latin1").decode("utf8")

the problem still stands, i dont know if my solution was perfect or not but thats a diff issue. but i think it would do fine. french characters in cookies is a different thing, its just given back to you, as it was sent. Its not a command to a browser or HTTP server, I dont think it matters what bit pattern you put there.

as for "PEP-3333 requires all these strings to be passed to the WSGI application as 'str'"

PEP-3333 requirements are for the contract between WSGI_Server/Gateway and the Application/Framework and not for contracts between framework and web application. PEP-3333 in fact says nothing about the latter.

and since micro frameworks aim to be un-opinionated and allow the devs to decide where possible, i see no reason whatsoever why bottle itself needs anything beyond ASCII.

and beyond that, i dont see the need to filter something through 2 encoding schemes. Theoretically you are given either bytes or unicode. if bytes pass along, if unicode, encode and pass along. dont understand why it has to go through two schemes, and wrongly in this case. Again it makes no sense to write a German essay and then try to interpret it as Chinese or something.

v-python commented 4 years ago

i dont see the need to filter something through 2 encoding schemes. Theoretically you are given either bytes or unicode.

And that is where you misunderstand. Marcel's answer was fine, although he didn't emphasize the salient point that you are missing. When he said this:

You could also say that WSGI uses latin1 as a transparent encoding to store byte data in unicode strings without loosing any information, to allow re-encoding with the correct encoding later. Byte strings in disguise.

what you missed is that your assumed theory is all wet. You theorize that the data is either bytes, or unicode, but that is simply not true. The data that comes in is in a Python str, but it contains one byte of data in each char. In Python 2, that is exactly the same as bytes. In Python 3, that is not exactly the same as bytes. In Python 2, enocding or decoding using 'latin1' is the identity operation. In Python 3 it is not. The design of WSGI was during the period when both Python 2 and Python 3 were in widespread use. This encode/decode scheme was invented so that both Python 2 and Python 3 could use the str type to manipulate the data, and code could be shared between applications running on either.

If this isn't a satisfactory explanation, you need to go read the WSGI spec, and its design history until you understand that your assumed theory doesn't match the actual facts, and that this encode/decode is doing exactly what it is supposed to do.

defnull commented 4 years ago

This aspect of WSGI is indeed very strange and unintuitive, but also interesting. I don't mind talking about it. It's a very common misunderstanding for people that try to work with WSGI directly. Thankfully, we have frameworks to abstract away the design warts of earlier days.

Perhaps another approach to explain it. This is what actually happens with a cookie header:

# Apps work with text (unicode). Forcing them to work with raw bytes or needlessly
# limiting them to ASCII is completely unnecessary.
app = "Set-Cookie: name=Bjørn"

# WSGI expects `str` (which is unicode in python3) and promises to encode them
# using `latin1`. But we want utf8, not latin1, so we have to wrap the bytes we actually
# want to have on the wire in a unicode string that will produce these bytes when
# encoded with `latin1` later.
# Thankfully, this is abstracted away by the WSGI framework (e.g. bottle). 
wsgi = app.encode('utf8').decode('latin1')

# The WSGI server will blindly encode all strings with latin1, as promised.
# Since latin1 is a bijective codec, we get back the utf8 encoded byte sequence we want. 
http = wsgi.encode('latin1')

# Browser decodes bytes to text using utf8 in most places, if no other encoding is set.
browser = http.decode('utf8')
print("Browser sees:", browser)

# <-- And now the way back -->

# Browsers work with text (unicode). HTML pages have an encoding, links are text,
# forms are text, everything but file uploads is text.
browser = 'Cookie: name=Bjørn'

# Browsers encode text with utf8 in most places, if no other encoding is set.
http = browser.encode('utf8')

# WSGI does not know the encoding used by the browser, but it has to pass
# 'str' (unicode) to the app. The only choice is to use a reversible (bijecttive)
# encoding like latin1 to allow the app to fix the mess later. Or to pass bytes
# to the app, but that was decided against in the early days for various reasons.
wsgi = http.decode('latin1')

# The application expects correctly decoded text values, so bottle has to fix it.
# It will decode the WSGI provided text with latin1 to get the original bytes back,
# then apply the correct codec. All without the developer knowing. Neat, isn't it?
app = wsgi.encode('latin1').decode('utf8')
print("App sees:", app)

You see? No UnicodeDecodeError and the result is correct. It also works with chinese or russian glyphs that are not covered by latin1. It works with all of unicode. And this is how all WSGI frameworks do it, I promise.

defnull commented 4 years ago

The reason for all this mess is that in the early days, byte strings in Python3 were very limited. There were just byte arrays with little to no convenience methods. The web-sig guys/gals had to fix the WSGI spec to work with Python3, somehow. They had a choice:

Change WSGI to work with 'bytes' instead of 'str' and require applications/frameworks to explicitly encode/decode everything, everywhere. Since 'bytes' API sucked at that point, this would make writing applications directly against the WSGI spec without a framework a real pain. But who does that, anyways?
Implement a hack with bytes represented as latin1 unicode strings. This somehow works transparently with most headers and very simple non-international applications, but requires this strange re-encoding dance as soon as you want to actually do something useful.

They decided for the second option. Probably because that required only a small additional section and some minor fixes to the spec, and they could keep the old examples and most of the wording intact.

I personally think this was an error, and they should have targeted the WSGI spec at framework authors, not application developers. WSGI sucks as a direct application API anyway. But here are we now. WSGI is the way it is, and most developers rely on frameworks to hide the details.