Closed leotrubach closed 4 years ago
I don't believe this change is necessary, if you have a QuerySetSequence
and call iterator()
on it then it should not cache, just like Django does. Is this not working when using the ordered iterator for some reason?
@clokep Yes I also thought so :)
Here is the source of __iter__
method for django's Queryset:
def __iter__(self):
"""
The queryset iterator protocol uses three nested iterators in the
default case:
1. sql.compiler.execute_sql()
- Returns 100 rows at time (constants.GET_ITERATOR_CHUNK_SIZE)
using cursor.fetchmany(). This part is responsible for
doing some column masking, and returning the rows in chunks.
2. sql.compiler.results_iter()
- Returns one row at time. At this point the rows are still just
tuples. In some cases the return values are converted to
Python values at this location.
3. self.iterator()
- Responsible for turning the rows into model objects.
"""
self._fetch_all()
return iter(self._result_cache)
So when you iterate over queryset (i.e. calling iter(qs)) it will first fetch all rows from it into memory and then iterate over that list. Yes, this is counterintuitive. When using QuerysetSequence with pagination, the pagination occurs after all rows for subquerysets are fetched and OutOfMemory occurs. We faced that on our project.
Please keep in mind that using iterator()
affects prefetch_related
optimizations. So maybe it will be sufficient to slice subquerysets instead of using iterator()
method.
I do not believe that __iter__
method is used when iterator()
is called, it should use the __iter__
of one of the iterable models, e.g. https://github.com/django/django/blob/1.11/django/db/models/query.py#L47-L85
There's a bunch of tests to make sure this works in TestIterator
, but I actually don't think it covers the code of the ordered iterator. I'm looking into that now.
I looked a bit more at this and I don't see any bugs, whenever I use iterator()
it re-evaluates queries and does not set the _results_cache
of the underlying QuerySet
s to anything.
Can you show an example of code that this isn't working for? I'm using things like this:
qs = QuerySetSequence(Article.objects.all(), Book.objects.all())
for m in qs.iterator():
print(m)
So I looked a bit more at this and can't find anything wrong with when Django is caching a query vs. when it is not being cached. Regardless, I'm fairly certain the fix here isn't correct since it doesn't ever allow for caching of the query, which you actually want sometimes.
Please let me know a bit more about how / when you're using this and we can investigate more.
Imagine we have a lot of Article
and Book
objects (~10000 rows each)
How would your code behave in case of:
qs = QuerySetSequence(Article.objects.all(), Book.objects.all()).order_by(...)
print(list(qs[:5]))
Will it fetch all 10000 articles and 10000 books and then return only 5 records?
Maybe to avoid fetching whole subquerysets in _ordered_iterator()
it's better to first do something like this:
self._querysets = [qs[:self._high_mark-self._low_mark] for qs in self._querysets]
or
def _ordered_iterator(self):
querysets = [qs[:self._high_mark-self._low_mark] for qs in self._querysets]
iterables = []
for i, qs in zip(self._queryset_idxs, querysets):
it = iter(qs)
and then merge?
Or something like this:
def _ordered_iterator(self):
"""
Interleave the values of each QuerySet in order to handle the requested
ordering. Also adds the '#' property to each returned item.
"""
# A list of tuples, each with:
# * The iterable
# * The QuerySet number
# * The next value
#
# (Remember that each QuerySet is already sorted.)
iterables = []
if self._high_mark:
if self._low_mark:
qslice = slice(None, self._high_mark - self._low_mark)
else:
qslice = slice(None, self._high_mark)
else:
qslice = slice(None)
Pushed this change to PR
I'm not sure this makes sense -- isn't this now using the slice of the items to slice the list of QuerySet
s? I just tweaked the travis config so that it should run on PRs now, so please re-push this so we can see if tests pass.
Slicing queryset before it is evaluated applices limit/offset to query on database level. I believe it optimizes the query
Ah, I mis-read what the code is doing, it is essentially limiting each QuerySet
to the maximum number that are possible in the slice. So, if you have even distributions, you would only pull 2x the amount of objects in your example.
This would need tests added before merging it.
Yes N*slice_size
for N subquerysets because there is possibility that all records may be from one subqueryset only.
So I'm going to write test that creates 10 items for each model. Then I first query it with .all(), wrap in a list and slice the resulting list. After that I create another Model and do `.all()[:5], and compare with previous list. List must be the same.
@clokep anything else needed to be tested? What can I do to get this PR merged?
@leotrubach I need to sit down and reason about some edge-cases before merging this. I'll try to take a look at it soon!
Closing due to lack of response.
Since
iter(qs)
loads the whole query into memory before it is paginated this leads to high memory usage and if there are a lot of rows in queryset we will get OutOfMemory error.Things to consider (not part of this PR, but let's discuss): Using
iterator()
method solves this issue. If Django version is higher than 2 this could be optimized even better by passingchunk_size
to appropriate value depending on values of_low_mark
and_high_mark
. Another idea: slice self._querysets if _high_mark is not None before iterating over them.