clokep / django-querysetsequence

Chain multiple (disparate) QuerySets in Django
https://django-querysetsequence.readthedocs.io/
ISC License
107 stars 25 forks source link

Slice subquerysets before merging them #52

Closed leotrubach closed 4 years ago

leotrubach commented 5 years ago

Since iter(qs) loads the whole query into memory before it is paginated this leads to high memory usage and if there are a lot of rows in queryset we will get OutOfMemory error.

Things to consider (not part of this PR, but let's discuss): Using iterator() method solves this issue. If Django version is higher than 2 this could be optimized even better by passing chunk_size to appropriate value depending on values of _low_mark and _high_mark. Another idea: slice self._querysets if _high_mark is not None before iterating over them.

clokep commented 5 years ago

I don't believe this change is necessary, if you have a QuerySetSequence and call iterator() on it then it should not cache, just like Django does. Is this not working when using the ordered iterator for some reason?

leotrubach commented 5 years ago

@clokep Yes I also thought so :) Here is the source of __iter__ method for django's Queryset:

    def __iter__(self):
        """
        The queryset iterator protocol uses three nested iterators in the
        default case:
            1. sql.compiler.execute_sql()
               - Returns 100 rows at time (constants.GET_ITERATOR_CHUNK_SIZE)
                 using cursor.fetchmany(). This part is responsible for
                 doing some column masking, and returning the rows in chunks.
            2. sql.compiler.results_iter()
               - Returns one row at time. At this point the rows are still just
                 tuples. In some cases the return values are converted to
                 Python values at this location.
            3. self.iterator()
               - Responsible for turning the rows into model objects.
        """
        self._fetch_all()
        return iter(self._result_cache)

So when you iterate over queryset (i.e. calling iter(qs)) it will first fetch all rows from it into memory and then iterate over that list. Yes, this is counterintuitive. When using QuerysetSequence with pagination, the pagination occurs after all rows for subquerysets are fetched and OutOfMemory occurs. We faced that on our project.

leotrubach commented 5 years ago

Please keep in mind that using iterator() affects prefetch_related optimizations. So maybe it will be sufficient to slice subquerysets instead of using iterator() method.

clokep commented 5 years ago

I do not believe that __iter__ method is used when iterator() is called, it should use the __iter__ of one of the iterable models, e.g. https://github.com/django/django/blob/1.11/django/db/models/query.py#L47-L85

There's a bunch of tests to make sure this works in TestIterator, but I actually don't think it covers the code of the ordered iterator. I'm looking into that now.

clokep commented 5 years ago

I looked a bit more at this and I don't see any bugs, whenever I use iterator() it re-evaluates queries and does not set the _results_cache of the underlying QuerySets to anything.

Can you show an example of code that this isn't working for? I'm using things like this:

qs = QuerySetSequence(Article.objects.all(), Book.objects.all())
for m in qs.iterator():
    print(m)
clokep commented 5 years ago

So I looked a bit more at this and can't find anything wrong with when Django is caching a query vs. when it is not being cached. Regardless, I'm fairly certain the fix here isn't correct since it doesn't ever allow for caching of the query, which you actually want sometimes.

Please let me know a bit more about how / when you're using this and we can investigate more.

leotrubach commented 5 years ago

Imagine we have a lot of Article and Book objects (~10000 rows each) How would your code behave in case of:

qs = QuerySetSequence(Article.objects.all(), Book.objects.all()).order_by(...)
print(list(qs[:5]))

Will it fetch all 10000 articles and 10000 books and then return only 5 records?

leotrubach commented 5 years ago

Maybe to avoid fetching whole subquerysets in _ordered_iterator() it's better to first do something like this:

self._querysets = [qs[:self._high_mark-self._low_mark] for qs in self._querysets]

or

def _ordered_iterator(self):
    querysets = [qs[:self._high_mark-self._low_mark] for qs in self._querysets]
    iterables = []
    for i, qs in zip(self._queryset_idxs, querysets):
        it = iter(qs)

and then merge?

leotrubach commented 5 years ago

Or something like this:

    def _ordered_iterator(self):
        """
        Interleave the values of each QuerySet in order to handle the requested
        ordering. Also adds the '#' property to each returned item.
        """
        # A list of tuples, each with:
        #   * The iterable
        #   * The QuerySet number
        #   * The next value
        #
        # (Remember that each QuerySet is already sorted.)
        iterables = []
        if self._high_mark:
            if self._low_mark:
                qslice = slice(None, self._high_mark - self._low_mark)
            else:
                qslice = slice(None, self._high_mark)
        else:
            qslice = slice(None)

Pushed this change to PR

clokep commented 5 years ago

I'm not sure this makes sense -- isn't this now using the slice of the items to slice the list of QuerySets? I just tweaked the travis config so that it should run on PRs now, so please re-push this so we can see if tests pass.

leotrubach commented 5 years ago

Slicing queryset before it is evaluated applices limit/offset to query on database level. I believe it optimizes the query

clokep commented 5 years ago

Ah, I mis-read what the code is doing, it is essentially limiting each QuerySet to the maximum number that are possible in the slice. So, if you have even distributions, you would only pull 2x the amount of objects in your example.

This would need tests added before merging it.

leotrubach commented 5 years ago

Yes N*slice_size for N subquerysets because there is possibility that all records may be from one subqueryset only.

leotrubach commented 5 years ago

So I'm going to write test that creates 10 items for each model. Then I first query it with .all(), wrap in a list and slice the resulting list. After that I create another Model and do `.all()[:5], and compare with previous list. List must be the same.

leotrubach commented 5 years ago

@clokep anything else needed to be tested? What can I do to get this PR merged?

clokep commented 5 years ago

@leotrubach I need to sit down and reason about some edge-cases before merging this. I'll try to take a look at it soon!

clokep commented 4 years ago

Closing due to lack of response.