daggaz / json-stream

Simple streaming JSON parser and encoder.
MIT License
122 stars 18 forks source link

how do I get a stream from a JSON list #58

Closed gilesheron closed 2 months ago

gilesheron commented 2 months ago

Hi,

I'm calling an API that returns a JSON list. so the object I get back is a TransientJSONStreamingList rather than a TransientJSONStreamingObject. Which means I can't call .items() on it. Is there somethig I'm missing?

G.

gilesheron commented 2 months ago

so this seemed to work:

with requests.get(url, stream=True) as response:
    for object in json_stream.requests.load(response):
        yield json_stream.to_standard_types(object)

is that the best approach?

daggaz commented 2 months ago

Hi Giles,

Yes, assuming your JSON is in the form:

[
  {"some": "object", "with": "properties"},
  ...many more objects
]

I assume that you're streaming due to having a very long top-level list?

You don't actually need to use to_standard_types() unless the calling code is expecting dict instances.

If so, then iterating it like that seems sensible.

If that is not the case, you can just return the result of load() directly:

def make_request(url):
    with requests.get(url, stream=True) as response:
        return json_stream.requests.load(response)

def use_data():
    for item in make_request("https://example.com/json"):
        print(item['some_key'])

The above requires that you access the keys of item in "stream order". If you cannot access keys in order, then you can use "mixed-mode" to get random access objects from a transient streaming list:

def make_request(url):
    with requests.get(url, stream=True) as response:
        return json_stream.requests.load(response).persistent()
gilesheron commented 2 months ago

Yes - my list has a few hundred thousand items in it.

each item is a dict with 3 keys (one of the 3 values is itself another dict - which contains yet another dict within it and so on...)

so the issue is my code needs to get the key from one of the inner dictionaries, plus the value from a dictionary entity within that dictionary. Once I get that key I can no longer access the dictionary. But I can't see a way to get the dictionary instead of just the key (other than calling "to_standard_types").

not sure my explanation makes sense - but I think it tallies with what you said re "unless the calling code is expecting dict instances"?

daggaz commented 2 months ago

not sure my explanation makes sense - but I think it tallies with what you said re "unless the calling code is expecting dict instances"?

In this case I mean, that the objects returned must be standard python dicts (or a subclass), i.e. isinstance(obj, dict) == True. The standard library json module for example requires that you pass it a strict set of types (or override parts of it to do your own serialization).

The objects returned by json_stream are not standard python dicts, so in the case of json.dumps(some_object_you_got_from_json_stream), you would get an error without first passing it through to_standard_types() or using JSONStreamEncoder.

For your specific use-case, there's nothing wrong with using to_standard_types(), but the expected thing to do would be to use .persistent() on the list, which will allow you to iterate the top level list "transiently" (i.e. the data is thrown away once you've moved past it) but the items in the list will be "persistent" (i.e. they will act just list normal dicts and allow you to access keys in a different order that they appear in the stream).

gilesheron commented 2 months ago

thanks - that all works now using the persistent list.

and thanks again for making the library available - most useful!