Closed cjwatson closed 3 years ago
Hmm I'm not sure. parse_qs
does have an encoding
parameter in 2.7.4+ and 3.1+, which defaults to utf8
and is passed down to unquote
. So, if I understand that correctly, passing bytes to that function will assume utf8
for percent-encoded strings and would return wrong results if any other encoding is expected. You also missed percent-encoded keys I think.
Why not pass bytes to parse_qs
and also set the encoding
parameter? For python versions below 2.7 we need a workaround, but for anything remotely recent that should work as expected.
Okay, I should have read #30 first. So, this is a workaround for ancient python versions NOT having the encoding
parameter? Hmm. May work, but keys need attention too.
You're right about Python 3 (which is a separate branch; this branch is py2-only), but there's no encoding
parameter to parse_qs
in 2.7 as far as I can see. Where are you looking?
I git-blamed the line that added the parameter and according to github, this was introduced 11 years ago in 2.7.4rc1 https://github.com/python/cpython/commit/ac71c54b88b916e05e4279e9c4306d227f3a6dfe
I think git blame
is misleading you somehow. https://github.com/python/cpython/blob/v2.7.18/Lib/urlparse.py#L385 doesn't show such a parameter, and https://docs.python.org/2/library/urlparse.html#urlparse.parse_qs doesn't document it.
Keys are a good point, though (I overlooked those since they're rather more likely to be plain ASCII identifiers in most cases). Should be fixed now.
So gitlab is lying to me :D
Yeah, I'm very confused about why it thinks that; maybe Python's complex git history with lots of cherry-picks between branches is confusing it somehow. When I run git describe --contains ac71c54b88b916e05e4279e9c4306d227f3a6dfe
and git describe --tags ac71c54b88b916e05e4279e9c4306d227f3a6dfe
locally, it looks like it was introduced somewhere between 3.2b2 and 3.2rc1.
Thanks!
It's confusing for raw bytes and percent-encoded bytes in URL-encoded form data to be decoded using different character sets; this happened if passing a
charset
parameter other than "latin1" toparse_form_data
.https://url.spec.whatwg.org/#application/x-www-form-urlencoded intentionally doesn't cover non-UTF-8 cases, but it explicitly says that a parser should perform bytewise percent-decoding followed by UTF-8 decoding, so we might infer that we should do something analogous for whatever
charset
is specified. However, Python 2'sunquote
(called byparse_qs
unconditionally decodes as ISO-8859-1 if called with Unicode input that contains percent-encoded sequences. To work around this, make sure to call it with bytes input, and then decode the individual values ourselves.This is similar to #30, with corresponding test cases, but the details are different for Python 2. This was noticed by the Launchpad test suite after applying the changes from https://github.com/zopefoundation/zope.publisher/pull/66 (although it was a problem before that, just masked).