defnull / multipart

A fast multipart/form-data parser for python
https://multipart.readthedocs.io/
MIT License
135 stars 33 forks source link

Consistently decode URL-encoded form data #36

Closed cjwatson closed 3 years ago

cjwatson commented 3 years ago

It's confusing for raw bytes and percent-encoded bytes in URL-encoded form data to be decoded using different character sets; this happened if passing a charset parameter other than "latin1" to parse_form_data.

https://url.spec.whatwg.org/#application/x-www-form-urlencoded intentionally doesn't cover non-UTF-8 cases, but it explicitly says that a parser should perform bytewise percent-decoding followed by UTF-8 decoding, so we might infer that we should do something analogous for whatever charset is specified. However, Python 2's unquote (called by parse_qs unconditionally decodes as ISO-8859-1 if called with Unicode input that contains percent-encoded sequences. To work around this, make sure to call it with bytes input, and then decode the individual values ourselves.

This is similar to #30, with corresponding test cases, but the details are different for Python 2. This was noticed by the Launchpad test suite after applying the changes from https://github.com/zopefoundation/zope.publisher/pull/66 (although it was a problem before that, just masked).

defnull commented 3 years ago

Hmm I'm not sure. parse_qs does have an encoding parameter in 2.7.4+ and 3.1+, which defaults to utf8 and is passed down to unquote. So, if I understand that correctly, passing bytes to that function will assume utf8 for percent-encoded strings and would return wrong results if any other encoding is expected. You also missed percent-encoded keys I think.

Why not pass bytes to parse_qs and also set the encoding parameter? For python versions below 2.7 we need a workaround, but for anything remotely recent that should work as expected.

defnull commented 3 years ago

Okay, I should have read #30 first. So, this is a workaround for ancient python versions NOT having the encoding parameter? Hmm. May work, but keys need attention too.

cjwatson commented 3 years ago

You're right about Python 3 (which is a separate branch; this branch is py2-only), but there's no encoding parameter to parse_qs in 2.7 as far as I can see. Where are you looking?

defnull commented 3 years ago

I git-blamed the line that added the parameter and according to github, this was introduced 11 years ago in 2.7.4rc1 https://github.com/python/cpython/commit/ac71c54b88b916e05e4279e9c4306d227f3a6dfe

cjwatson commented 3 years ago

I think git blame is misleading you somehow. https://github.com/python/cpython/blob/v2.7.18/Lib/urlparse.py#L385 doesn't show such a parameter, and https://docs.python.org/2/library/urlparse.html#urlparse.parse_qs doesn't document it.

cjwatson commented 3 years ago

Keys are a good point, though (I overlooked those since they're rather more likely to be plain ASCII identifiers in most cases). Should be fixed now.

defnull commented 3 years ago

So gitlab is lying to me :D

image image

cjwatson commented 3 years ago

Yeah, I'm very confused about why it thinks that; maybe Python's complex git history with lots of cherry-picks between branches is confusing it somehow. When I run git describe --contains ac71c54b88b916e05e4279e9c4306d227f3a6dfe and git describe --tags ac71c54b88b916e05e4279e9c4306d227f3a6dfe locally, it looks like it was introduced somewhere between 3.2b2 and 3.2rc1.

cjwatson commented 3 years ago

Thanks!