JSON list as Python generator?

Mec-iS commented 8 years ago

I am collecting information about the possibility to use a generator instead of loading the full JSON in memory as manager get called:

Possible algorithm:

manager loads the JSON and create two generators: one to be kept as a blueprint, the other to be consumed at every filtering operation
filtering or any other action consumes the generator,
function return the resulting filtered output
a new generator is copied from the blueprint to serve the next operation
optional: create a index (a dictionary JSON value > JSON position in the array) for subsequent functions' calls, or instead create a copy of the generator to keep in memory (to avoid the generator to be built again at each filtering calls, see caveat below).

I couldn't find any memory/CPU-attentive method in the Standard Library to accomplish the cloning or the deep copy of a generator in memory, the only one is tee() but it seems to have downsides for our usecase:

This answer here underlines the fact that creating the generator twice is CPU-intensive while dumping a copy into a list() can be better if you think of consuming the generator until the end
Consider these three cases
See the snippet here for creating an iterable class

Does it sounds like a good idea?

agateblue commented 8 years ago

I just pushed a release (0.2) an hour ago that implements lazy querysets (they will probably get some improvements soon). You can also pass generators to the manager and it will only be iterated accessing queryset data (note that the resulted data will still be in memory though). I think it partially adress your issue, at least the part regarding memory usage.

However, once the generator is consumed, lifter won't be able to consume it again.

The solution that comes to mind it to allow passing a callable to load(). When the times come to filter the values, the callable will return a generator. Example:

def return_json_generator():
    return generator

manager = lifter.load(return_json_generator)

This seems easier to implement than the blueprint you suggested.

When #25 will be fixed, it will also increase performance (generator will only be looped once, regardless the number of filters/excludes applied).

I'm not really fond of the index, at least currently: the package is still in alpha state and I'd rather not reinvent a whole database system at this point. Also, in your present situation, I think any effort you'll deploy to reduce the memory footprint of your queries will be useless if you need to maintain an index of your whole data in memory.

Mec-iS commented 8 years ago

I was thinking to something like:

#
# pseudocode
#

import copy

def create_generator(json_list):
    for object in json_list:
        yield object

generator = create_generator(JSON)
generator_copy = copy.deepcopy(generator)
while True:
    result = filter(next(generator_copy))

This way you can save memory by using a generator for all the filtering operations you apply. The manager creates the generator; each time a filter operation is required, a deep copy of the generator is made and consumed.

agateblue commented 8 years ago

Yes this is exaclty that, the only difference is that you won't even need to deepcopy the generator, instead, you pass a callable that returns a generator to the manager, and the manager will call this function to get a ready-to-loop generator or iterable.

The main advantage over your proposal is that you can call the manager a thousand time if you want, without providing a different copy each time, and it will still work. With your example, after you run result = filter(next(generator_copy)), you will have to feed your manager with another copy, which is not really convenient.

agateblue commented 8 years ago

I'll leave this open since I still need to implement the callable feature ;)

agateblue / lifter

JSON list as Python generator? #24