Encoding of url queries is specified to be utf-8.

Jolbas commented 3 years ago

It seems the url encoding actually is defined to be utf-8 nowadays and since HTML5 was introduced. https://url.spec.whatwg.org/#application/x-www-form-urlencoded And I think that hardcoded latin1 decodings like this is causing unnecessary trouble with a need to encode and decode: https://github.com/bottlepy/bottle/blob/f9b1849db4dd724e36a93a1032be592193fba581/bottle.py#L123 This line was introduced in https://github.com/bottlepy/bottle/commit/a3dcff4cd676cac37e513c7e044e5be419da175d but I can't reproduce the reported problems even if I remove the mentioned line

Jolbas commented 3 years ago

After some more tests I found that issue https://github.com/bottlepy/bottle/issues/342 is back if I remove line 123. But if moving the catching of UnicodeError from getunicode() to _fix(). Then _fix() kan return the input string on UnicodeError which I think make more sense.

    def _fix(self, s, encoding=None):
        if isinstance(s, unicode) and self.recode_unicode:  # Python 3 WSGI
            try:
                return s.encode('latin1').decode(encoding or self.input_encoding)
            except (UnicodeError):
                return s
        elif isinstance(s, bytes):  # Python 2 WSGI
            return s.decode(encoding or self.input_encoding)
        else:
            return s

and

    def getunicode(self, name, default=None, encoding=None):
        """ Return the value as a unicode string, or the default. """
        try:
            return self._fix(self[name], encoding)
        except (KeyError):
            return default

Jolbas commented 3 years ago

This also makes FormsDict.__getattr__() better when there are non ascii characters in the query key. With for example https://example.com/test?%C3%B6=%C3%B6 this would be true: bottle.request.query.ö == 'ö'

defnull commented 3 years ago

HTTP is not unicode-aware and headers are byte-strings. There were annoying to work with in early days of python 3, so the PEP-3333 (WSGI) authors defined that str is to be used, but bytes are de/encoded with the reversible latin1 codec so applications can re-code them to/from the desired codec. Bottle already defaults to utf-8 in most places and re-encodes correctly.

That latin1 is also used for url-encoded form data in the body of a request is due to a bug in early versions of cgi.FieldStorage. The re-encoding to utf8 happens in the returned FormDict and is actually change-able. HTML5 may default to utf-8 but HTTP client libraries, command line tools and legacy software may actually use a different encoding and being able to change that on application-level at runtime is kind of important.

Also note that getunicode() does not return the un-transcoded value on UnicodeError, but the fallback value. Silently ignoring UnicodeError and just returning the broken latin1 version of the string may result in really annoying and hard to debug behavior. I'm not sure if hat's an improvement.

Jolbas commented 3 years ago

Yes. I understand better now how it works and why. I would have preferred a ready decoded FormsDict from bottle.request.query but I understand your point about other HTTP tools may need other encodings and can manage to do bottle.request.query.decode() instead.

But about catching UnicodeError in getunicode(). Isn't returning the default value also kind of silently ignoring? Just curious about the reason to return an empty string here but throwing the error in decode(). Wouldn't it be better to not catch the UnicodeError?

Jolbas commented 3 years ago

This may be a little off topic now but I think that using errors='replace' in _fix() is better than returning default on UnicodeError. Because as it is now, it behaves like the key doesn't exists.

return s.encode('latin1').decode(encoding or self.input_encoding, errors='replace')

https://github.com/bottlepy/bottle/blob/f9b1849db4dd724e36a93a1032be592193fba581/bottle.py#L2209

defnull commented 3 years ago

Attribute access is documented as returning an empty string on missing (or broken) values. That's just for convenience. Empty strings still evaluate as false and can easily be checked, but won't break the template if used in an <input value=...> attribute. Some common form-handling patterns work nicely with this approach.

Changing this to errors='replace' would allow broken (non-utf-8) values to appear as valid (non-empty) input. It would hide encoding errors and cause applications to store and process broken data. Hiding errors is usually not a good idea.

We could think about also applying _fix to the key during lookup, since FormDict.decode() does that too. But are unicode-keys really that common?

Jolbas commented 3 years ago

I don't know if it's common to use unicode keys. And it is probably hard to find anyone recommending it. I use it sometimes if it can make the code easier to read. It would be helpful if _fix() is applied to the keys during lookup but that would cause an unnecessarily complicated lookup process, wouldn't it?

I think that the current behaviour of __getattr__() with the empty string as default is that kind of problematic error hiding you are talking about, even if it is documented. Actually more hiding than the behavior of errors='replace'. Often an empty string is a valid input from forms, and an encoding error will likely trigger unexpected actions in that case. And the application can currently not distinguish between intended empty string and an error. Shouldn't it rather return None as default like getunicode()?

defnull commented 3 years ago

Often an empty string is a valid input from forms, and an encoding error will likely trigger unexpected actions in that case. And the application can currently not distinguish between intended empty string and an error.

The current implementation returns an empty string for missing, empty, or bad data. This makes it easy to check for the presence of a valid value (if request.forms.key: ...), which is by far the most common check. If empty data is also valid, but missing data is not, then if 'key' in request.forms: ... is the obvious solution. The only case that is not trivial to check for is bad data. True.

But fortunately, unicode decoding errors are the absolute exception. Browsers and HTML5 default to utf-8. Form values with non-utf8-decodable data should never happen, and if they happen, then there is no perfect solution. Either you require the developer to handle this case explicitly (by throwing an error or returning None), or you try to do what most applications would do anyway: Threat bad data as no/missing data. This is what bottle does. Fixing 'some' of the bad data (as error=replace does) and hoping this does not cause issues later, is a gamble. It may work for you, it might not for others.

Yes, this is something one can argue about. But changing one imperfect solution for another, and breaking backwards-compatibility on the way, is not really appealing to me. If you expect bad data, and want to use as much of it as possible, you still have the option to bypass this convenient attribute-access method and just decode the original value yourself. This is part of bottles philosophy: sensible and convenient defaults, but easy ways to bypass the framework them if needed.

I'll accept pull requests for documentation changes that make this behavior cleaner, but I do not think that changing the current behavior is necessary at this point.

bottlepy / bottle

Encoding of url queries is specified to be utf-8. #1334