Closed alexey-medvedchikov closed 1 year ago
Ah, good catch. It seems like 1 is the default: https://requests.readthedocs.io/en/latest/api/#requests.Response.iter_content
The chunk size is the number of bytes it should read into memory.
It's a strange default, I would suggest raising the default here at least. 64k seems quite conservative.
Hey,
Thanks for spotting this.
Interestingly, requests uses iter_content()
internally to implement the various higher-level functions, but with different values for chunk_size
:
.content
: 10k.text
/.json()
: 10k (uses .content
).iter_lines()
: 512 bytesAt the time these we defined, there was a lot of discussion about these choices, and the method of reading from the underlying socket/file.
The performance benefit of chunk_size > 1
is obvious.
The only downside I can think of is support for some sort of streamed JSON protocol, where for example, you start reading a list of "updates" and new updates appear as and when. Since iter_content()
will block until chunk_size
has been read, you wouldn't get an update of length < chunk_size
immediately, but only after other updates have accumulated to fill up the buffer.
I think the right thing to do is to match requests' default 10k chunk_size
, and allow callers to override? Perhaps with an documented immediate
flag to set it back to 1?
Hi, @daggaz
No problem at all! Thanks for the quick reply.
How about that - we set chunk_size
to 10k by default and allow user to set it as chunk_size
that is passed directly through .load
and .visit
? This approach makes it obvious what it actually controls (people who use requests
should be familiar with it) and gives user full control over this instead of 10k (default) and 1 (immediate=True
) options: maybe there're some other use cases we just not aware about.
Also, how do I document this? I'm coming from Go world, and I'm not really sure if I understand where is the autogenerated API docs are and how they are generated. Or should I just stick explanation into README.md?
Cheers.
Yeah, that's reasonable.
Yes, just updating the readme is what to do. This project doesn't have any autodocs. Though it certainly could. In python land you write docstrings (triple quoted strings at the very start of a module/function/class/etc). These have an official grammar for annotating parameters etc too.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Alexey Medvedchikov @.> Sent: Tuesday, November 1, 2022 12:35:40 AM To: daggaz/json-stream @.> Cc: Jamie Cockburn @.>; Mention @.> Subject: Re: [daggaz/json-stream] Add chunk_size parameter to json_stream.requests.load and .visit to pass to requests' content reader. (PR #26)
Hi, @daggazhttps://github.com/daggaz
No problem at all! Thanks for the quick reply.
How about that - we set chunk_size to 10k by default and allow user to set it as chunk_size that is passed directly through .load and .visit? This approach makes it obvious what it actually controls (people who use requests should be familiar with it) and gives user full control over this instead of 10k (default) and 1 (immediate=True) options: maybe there're some other use cases we just not aware about.
Also, how do I document this? I'm coming from Go world, and I'm not really sure if I understand where is the autogenerated API docs are and how they are generated. Or should I just stick explanation into README.md?
Cheers.
— Reply to this email directly, view it on GitHubhttps://github.com/daggaz/json-stream/pull/26#issuecomment-1297856613, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGZET3G7PLXVTJ7OHTAGO3WGBQVZANCNFSM6AAAAAARQ5EJYM. You are receiving this because you were mentioned.Message ID: @.***>
@daggaz Updated the code and wrote a note on the matter. Please, let me know if you want to rephrase the note in some way.
When used without parameters requests' content iterator using chunk size of 1. I don't know what the measure of chunk unit is, but from the code I can tell it is bytes. So we're reading 1 byte (!!!) per call. This makes json_stream extremely slow on streaming requests. If you put chunk_size bigger than around 1K it can easily speed up parsing by a factor of 5.
Or should we just hardcode chunk_size to something like 65K inside
_to_file
?I don't see any noticeable memory consumption increase.
A quick benchmark of streaming requests, getting JSON from https://pypi.org/simple/ and parsing it: