trim21 commented 5 months ago

[ ] Initially raised as discussion #3168

Problem

we should check input value types. Current behavoir which simply convert input value to str(v) is not very ideal.

for example:

import httpx

r = httpx.post(
    "https://httpbin.org/post",
    data={
        "content": bytes.fromhex(
            "4f5f13d16da9f10d013933673a1ef4e7b974dbe3"
        ),  # random non-printable bytes
    },
)
print(r.json()["form"])

will get {'content': "b'O_\\x13\\xd1m\\xa9\\xf1\\r\\x0193g:\\x1e\\xf4\\xe7\\xb9t\\xdb\\xe3'"}

Proposal

add type checking and more coerce.

for example:

form

convert int/float to str with str(v)
bool (Represent as JSON-style "true"/"false")
None (Represent as empty string)
Enum (Use .value and then as above.)
TypeError for anything else.

query

except about, query should support bytes with percent encoding and do not throw.

multipart_data

not sure how to handle this....

This will cause breaking changes, so we should consider do a minor version bump.

https://github.com/encode/httpx/blob/4b85e6c3898b94e686b427afd83138c87520b479/httpx/_content.py#L136-L149

https://github.com/encode/httpx/blob/4b85e6c3898b94e686b427afd83138c87520b479/httpx/_utils.py#L56-L68

trim21 commented 4 months ago

there is also another issue, query string should support bytes natively.

for example:

import httpx
import urllib.parse

r = httpx.get(
    "https://httpbin.org/get",
    params={"q": bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")},
)

print(urllib.parse.quote(bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")))

print(r.text)

expected url should be https://httpbin.org/get?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E, instead of ValueError or TypeError, or currently https://httpbin.org/get?q=b\"\\xe1\\xee\\x0e'4\\x98oT\\x19\\xbbl\\xb6%25%2B\\xd97q\\x83D\\x0e\"

tomchristie commented 4 months ago

query string should support bytes natively.

I'm not convinced by that. Why?

trim21 commented 4 months ago

query string should support bytes natively.

I'm not convinced by that. Why?

because percent encoding can take native bytes as input

trim21 commented 4 months ago

when we say it support "string", it's actually assuming percent encoded bytes is utf8 content.

consider this:

from urllib.parse import quote, quote_from_bytes

utf8_s = "你好"

assert quote(utf8_s) == quote_from_bytes(utf8_s.encode())

https://example.com/?q=你好 is actually https://example.com/?q=%E4%BD%A0%E5%A5%BD

that's also what urllib.parse.quote is doing by default, it encode str (utf8 by default) to bytes then use quote_from_bytes

tomchristie commented 4 months ago

I think this is confusing a couple of separate issues...

Should str-coercion be performed.
Should bytes be a supported type.

I'm think there's a good case that we shouldn't str-coerce types outside of the expected range. I don't currently think there's a good case that we should support bytes.

trim21 commented 4 months ago

I would vote for both...

consider this:

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

looks fine, right?

but actually request url is ?q=%EF%BF%BD%EF%BF%BD%0E%274%EF%BF%BDoT%19%EF%BF%BDl%EF%BF%BD%25%2B%EF%BF%BD7q%EF%BF%BDD%0E&a=1, which it not right.

correct value of q is bytes.fromhex('e1ee0e2734986f5419bb6cb6252bd9377183440e'), but httpx send url with q bytes.fromhex('efbfbdefbfbd0e2734efbfbd6f5419efbfbd6cefbfbd252befbfbd3771efbfbd440e')

also send bytes as query value is supported by requests， which is handled correctly

cknv commented 4 months ago

Just to chip in a bit here:

I have been using httpx to interact with a legacy system and as test client to mimic that legacy system calling back into our starlette based API. Some of the query fields can occasionally be bytes, these bytes are either just bytes that only has meaning to systems outside of my control, or UTF-16 encoded strings (which as an aside can also mean that I have to produce a mixed content query string).

In order to get the correct values through httpx and not 'b"value"' I have had to resort to manually building the query myself with the stdlib and appending it on the URL. This does work, but it took some time to arrive at and I wouldn't exactly call it "ergonomic".

But I also realize that my case is not common.

tomchristie commented 4 months ago

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

trim21 commented 4 months ago

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

yes, that's a problem, I think.

if quote_from_bytes(unquote_to_bytes("%EE")) then everything is fine, in this case.

tomchristie commented 4 months ago

@cknv Could you give a (simplified) example please?

cknv commented 4 months ago

Sure @tomchristie! Get ready for some flashback to the dark ages, before UTF8 won the encoding wars.

from urllib.parse import urlencode
import httpx

client = httpx.Client()

values = {
    "expected-as-utf8": "Hello world",
    "expected-as-int": 1234,
    "expected-as-latin1": "Hello world".encode("latin1"),
    "expected-as-binary": b"\x01\x02\x03\x04",
}

encoded_query = urlencode(values)

# replaced .get with .build_request
# as making requests does not seem important for demonstrating my workaround
correctly_encoded_request = client.build_request(
    method="GET",
    url=f"http://localhost?{encoded_query}",
)
incorrecty_encoded_request = client.build_request(
    method="GET",
    url="http://localhost",
    params=values,
)

# compare and contrast:
print(correctly_encoded_request.url.params)
print(incorrecty_encoded_request.url.params)

When actually rereading the code, it wasn't UTF-16 that was used, but instead actually latin1, so I replaced that in the example, although either of them would need to be encoded into bytes.

Narrowing our focus to just the plain byte value, it should be expected-as-binary=%01%02%03%04, and my workaround indeed does produce that. Without the workaround and just feeding the values dict into the request builders params the value is reprd (via str) and produces expected-as-binary=b%27%5Cx01%5Cx02%5Cx03%5Cx04%27 which is the percent encoded version of expected-as-binary=b'\x01\x02\x03\x04'.

The latin1 value works without the workaround as long as you stick to the subset that is identical with ASCII, but we do at times get data through that isn't just that subset.

To add some context the server receiving the request is written in C, and has a very lax attitude to encodings because it's all just bytes anyway.

If httpx would percent encode bytes, I would probably be able to remove that workaround. However I would not be surprised if there are other ways to send bytes over http either established in an RFC or just by tradition, if so I would not know which one is the most appropriate, but it appears like percent encoding is at least in the RFC.

Finally, I would like to reiterate that I know that this is way outside the mainstream, I have mostly accepted my workaround, but I at least wanted to raise my hand when someone else expressed a similar problem.

Graeme22 commented 2 months ago

+1 for using .value for Enums in param encoding

trim21 commented 1 month ago

any news on this？

tomchristie commented 1 month ago

Currently I'd suggest we sharpen up on this and raise type errors on non-string values... https://github.com/encode/httpx/issues/3176#issuecomment-2313220498

HTML form encoding only supports strings, let's help our users be explicit here.

trim21 commented 1 month ago

HTML form encoding only supports strings, let's help our users be explicit here.

This is limited by html/browser, not HTTP protocol.

httpx is a HTTP client for Python, not a browser. We should support more low level operations. For example, we won't do CORS check like browser, right?

Also this is not entirely true. HTML form is post method with application/x-www-form-urlencoded or multipart/form-data encoded data as body, application/x-www-form-urlencoded is same encoding we used in url query, which suport any binary data.

And in non-utf8 worlds, browsers allow you send non-utf8 string as form value. but we don't have gbk string in python, so we need to use bytes.

And browsers also have URLSearchParams API supporting any binary data.

https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams#preserving_plus_signs

this not only a question should bytes be a supported type, it's a question should we support non-utf8 encoded string

Which I think the answer should be YES.

trim21 commented 1 month ago

Currently I'd suggest we sharpen up on this and raise type errors on non-string values... #3176 (comment)

HTML form encoding only supports strings, let's help our users be explicit here.

After seconds thoughts, I think HTML have different definitions of string as we have in Python worlds.

In HTML, string can be any encoding， but it's false in python.

tomchristie commented 5 days ago

I'm going to close this off in favor of the broader issue #3176.

encode / httpx

add more type checking in params/form-data #3170

Problem

Proposal

form

query

multipart_data