Allow setting `chunk_size` for `Response.iter_bytes()` etc...

encode / httpx

A next generation HTTP client for Python. 🦋

https://www.python-httpx.org/

BSD 3-Clause "New" or "Revised" License

13.17k stars 836 forks source link

Allow setting `chunk_size` for `Response.iter_bytes()` etc... #394

Closed sethmlarson closed 3 years ago

sethmlarson commented 5 years ago

Requests allowed setting chunk_size within .iter_content() which is currently not an option for our alternatives .stream() and .stream_text().

For .stream_text() we should go the extra step and fix the issue that users sometimes run into when using this feature and use chunk-size for measuring the decoded text, not the raw bytes.

tomchristie commented 5 years ago

I seem to remeber this playing into some primitives around streaming bytes vs text that we never ended up digging into?

tomchristie commented 4 years ago

A good first pass onto this would be to change the decoder interface slightly, so that instead of eg. yeilding a byte chunk, they yield a list of byte chunks.

On the first refactoring pass, we don't need to actually change the internal implmentation much - the decoders can just always yield a list with a single item.

We'd then be able to add a chunk_size argument to the decoders, which would return 0, 1, or many properly-sized chunks on each yield.

florimondmanca commented 4 years ago

Updated the issue title to reflect the current Response.aiter_* API :-) (see #610).

b0g3r commented 4 years ago

How could I help with this issue?

florimondmanca commented 4 years ago

Hi @b0g3r! I think this is still something we’d like to have, and given discussions in https://github.com/python-gitlab/python-gitlab/pull/1036 it seems like some folks would like to see it too. :)

Ways to move forward would be:

Propose an API for this, with context on the existing API on Requests
Investigate implementation details (ie how are we going to split chunks: buffering, other? Looking at how Requests/urllib3 do this can help)
Draft a PR :)

b0g3r commented 4 years ago

Do I understand correctly that we will need to forward chunk_size here? https://github.com/encode/httpx/blob/a82adcc933345c6b8cb1623b031eb85723e7665b/httpx/_dispatch/urllib3.py#L112-L115

florimondmanca commented 4 years ago

@b0g3r Careful that we're in a sort of transition state w.r.t. urllib3 usage due to #804 (we'll soon use our own sync implementation, though keeping urllib3 as an option). Due to this I wouldn't advise relying on any existing urllib3 functionality — also because we'd want to provide chunk sizing on the async layer too, and it'd be odd to have a different implementation on both sides.

I think we want to look at controlling the chunk size directly from response.iter_bytes()/response.aiter_bytes(), instead…

tomchristie commented 4 years ago

@b0g3r So, as with comment https://github.com/encode/httpx/issues/394#issuecomment-567899958 - the right place to start with this would be a pull request to https://github.com/encode/httpx/blob/master/httpx/_decoders.py that changes the interface of the decoders, so that they return a list of bytes rather than bytes.

(And correspondingly, changing the places where the response calls the decoder such as https://github.com/encode/httpx/blob/a82adcc933345c6b8cb1623b031eb85723e7665b/httpx/_models.py#L915 to deal with a list of bytes as a return result.)

I'd start with that as a foundational pull request, which will then make the remaining work much easier. (Adding chunk sizes to the decoder interface, and through to the response methods.)

b0g3r commented 4 years ago

(a)iter_raw(self, chunk_size=1)

chunk_size=1 because requests.Response.iter_content has it

instead of

for part in self._raw_stream:
yield part

let's use bytestring as buffer

buffer = b""
for part in self._raw_stream:
buffer += part
while len(buffer) >= chunk_size:
    yield buffer[:chunk_size]
    buffer = buffer[chunk_size:]
if buffer:
yield buffer

(a)iter_bytes(self, chunk_size=ITER_CHUNK_SIZE)

chunk_size=ITER_CHUNK_SIZE (512) because requests has it 🌚
calls (a)iter_raw

(a)iter_text(self, chunk_size=ITER_CHUNK_SIZE)

calls (a)iter_bytes

(a)iter_line(self, chunk_size=ITER_CHUNK_SIZE)

calls (a)iter_test
current code expects that each chunk containt full line(s), but it's not true (UPD: found code for splitting in LineDecoder)
requests has elegant solution

b0g3r commented 4 years ago

@tomchristie As I see (a)iter_raw doesn't use any decoder 🤔

piersoh commented 4 years ago

Would be good to have chunk_size=None option so that httpx can return chunks at the HTTP chunk boundaries as per the requests library - this is useful for apps that require timely delivery.