agateblue / lifter

A generic query engine, inspired by Django ORM
ISC License
449 stars 16 forks source link

JSON list as Python generator? #24

Open Mec-iS opened 8 years ago

Mec-iS commented 8 years ago

I am collecting information about the possibility to use a generator instead of loading the full JSON in memory as manager get called:

Possible algorithm:

I couldn't find any memory/CPU-attentive method in the Standard Library to accomplish the cloning or the deep copy of a generator in memory, the only one is tee() but it seems to have downsides for our usecase:

Does it sounds like a good idea?

agateblue commented 8 years ago

I just pushed a release (0.2) an hour ago that implements lazy querysets (they will probably get some improvements soon). You can also pass generators to the manager and it will only be iterated accessing queryset data (note that the resulted data will still be in memory though). I think it partially adress your issue, at least the part regarding memory usage.

However, once the generator is consumed, lifter won't be able to consume it again.

The solution that comes to mind it to allow passing a callable to load(). When the times come to filter the values, the callable will return a generator. Example:

def return_json_generator():
    return generator

manager = lifter.load(return_json_generator)

This seems easier to implement than the blueprint you suggested.

When #25 will be fixed, it will also increase performance (generator will only be looped once, regardless the number of filters/excludes applied).

I'm not really fond of the index, at least currently: the package is still in alpha state and I'd rather not reinvent a whole database system at this point. Also, in your present situation, I think any effort you'll deploy to reduce the memory footprint of your queries will be useless if you need to maintain an index of your whole data in memory.

Mec-iS commented 8 years ago

I was thinking to something like:

#
# pseudocode
#

import copy

def create_generator(json_list):
    for object in json_list:
        yield object

generator = create_generator(JSON)
generator_copy = copy.deepcopy(generator)
while True:
    result = filter(next(generator_copy))

This way you can save memory by using a generator for all the filtering operations you apply. The manager creates the generator; each time a filter operation is required, a deep copy of the generator is made and consumed.

agateblue commented 8 years ago

Yes this is exaclty that, the only difference is that you won't even need to deepcopy the generator, instead, you pass a callable that returns a generator to the manager, and the manager will call this function to get a ready-to-loop generator or iterable.

The main advantage over your proposal is that you can call the manager a thousand time if you want, without providing a different copy each time, and it will still work. With your example, after you run result = filter(next(generator_copy)), you will have to feed your manager with another copy, which is not really convenient.

agateblue commented 8 years ago

I'll leave this open since I still need to implement the callable feature ;)