Add chunk_size parameter to json_stream.requests.load and .visit to pass to requests' content reader.

alexey-medvedchikov commented 2 years ago

When used without parameters requests' content iterator using chunk size of 1. I don't know what the measure of chunk unit is, but from the code I can tell it is bytes. So we're reading 1 byte (!!!) per call. This makes json_stream extremely slow on streaming requests. If you put chunk_size bigger than around 1K it can easily speed up parsing by a factor of 5.

Or should we just hardcode chunk_size to something like 65K inside _to_file?

I don't see any noticeable memory consumption increase.

A quick benchmark of streaming requests, getting JSON from https://pypi.org/simple/ and parsing it:

import requests
import json_stream
import json_stream.requests
import json

url = 'https://pypi.org/simple/'
headers = {"Accept": "application/vnd.pypi.simple.v1+json"}

def nostream_request_vanilla_json():
    with requests.get(url, headers=headers) as resp:
        data = json.loads(resp.content)
        for project in data['projects']:
            print(project['name'])

def stream_request_json_stream():
    with requests.get(url, headers=headers, stream=True) as resp:
        data = json_stream.requests.load(resp)
        for project in data['projects']:
            print(project['name'])

no-streaming request + vanilla json: 2.1 sec
streaming request + json_stream with chunk_size 1: 28.1 sec
streaming request + json_stream with chunk_size 65536: 5.1 sec

jonasesnyk commented 2 years ago

Ah, good catch. It seems like 1 is the default: https://requests.readthedocs.io/en/latest/api/#requests.Response.iter_content

The chunk size is the number of bytes it should read into memory.

It's a strange default, I would suggest raising the default here at least. 64k seems quite conservative.

daggaz commented 1 year ago

Hey,

Thanks for spotting this.

Interestingly, requests uses iter_content() internally to implement the various higher-level functions, but with different values for chunk_size:

.content: 10k
.text/.json(): 10k (uses .content)
.iter_lines(): 512 bytes

At the time these we defined, there was a lot of discussion about these choices, and the method of reading from the underlying socket/file.

The performance benefit of chunk_size > 1 is obvious.

The only downside I can think of is support for some sort of streamed JSON protocol, where for example, you start reading a list of "updates" and new updates appear as and when. Since iter_content() will block until chunk_size has been read, you wouldn't get an update of length < chunk_size immediately, but only after other updates have accumulated to fill up the buffer.

I think the right thing to do is to match requests' default 10k chunk_size, and allow callers to override? Perhaps with an documented immediate flag to set it back to 1?

alexey-medvedchikov commented 1 year ago

Hi, @daggaz

No problem at all! Thanks for the quick reply.

How about that - we set chunk_size to 10k by default and allow user to set it as chunk_size that is passed directly through .load and .visit? This approach makes it obvious what it actually controls (people who use requests should be familiar with it) and gives user full control over this instead of 10k (default) and 1 (immediate=True) options: maybe there're some other use cases we just not aware about.

Also, how do I document this? I'm coming from Go world, and I'm not really sure if I understand where is the autogenerated API docs are and how they are generated. Or should I just stick explanation into README.md?

Cheers.

daggaz commented 1 year ago

Yeah, that's reasonable.

Yes, just updating the readme is what to do. This project doesn't have any autodocs. Though it certainly could. In python land you write docstrings (triple quoted strings at the very start of a module/function/class/etc). These have an official grammar for annotating parameters etc too.

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Alexey Medvedchikov @.> Sent: Tuesday, November 1, 2022 12:35:40 AM To: daggaz/json-stream @.> Cc: Jamie Cockburn @.>; Mention @.> Subject: Re: [daggaz/json-stream] Add chunk_size parameter to json_stream.requests.load and .visit to pass to requests' content reader. (PR #26)

Hi, @daggazhttps://github.com/daggaz

No problem at all! Thanks for the quick reply.

How about that - we set chunk_size to 10k by default and allow user to set it as chunk_size that is passed directly through .load and .visit? This approach makes it obvious what it actually controls (people who use requests should be familiar with it) and gives user full control over this instead of 10k (default) and 1 (immediate=True) options: maybe there're some other use cases we just not aware about.

Also, how do I document this? I'm coming from Go world, and I'm not really sure if I understand where is the autogenerated API docs are and how they are generated. Or should I just stick explanation into README.md?

Cheers.

— Reply to this email directly, view it on GitHubhttps://github.com/daggaz/json-stream/pull/26#issuecomment-1297856613, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGZET3G7PLXVTJ7OHTAGO3WGBQVZANCNFSM6AAAAAARQ5EJYM. You are receiving this because you were mentioned.Message ID: @.***>

alexey-medvedchikov commented 1 year ago

@daggaz Updated the code and wrote a note on the matter. Please, let me know if you want to rephrase the note in some way.

daggaz / json-stream

Add chunk_size parameter to json_stream.requests.load and .visit to pass to requests' content reader. #26