Open radzhome opened 5 years ago
Enable performance logging in mongo and try to figure out what query is being run? https://docs.mongodb.com/manual/reference/method/db.setLogLevel/
You will update that shortly, just trying to get output from db.currentOp()
when the query is run.
Sorry bit behind with this. I'm going to create a small project that reproduces the issue.
So here is the queries I see
DEBUG 2019-02-01 15:41:30,457 query 26660 123145427501056 DEBUG
sql_command: SELECT COUNT(*) AS "__count" FROM "content"
DEBUG 2019-02-01 15:41:32,509 query 26660 123145427501056 DEBUG Aggregation query: [{'$count': '_count'}]
DEBUG 2019-02-01 15:41:32,509 query 26660 123145427501056 DEBUG params: ()
DEBUG 2019-02-01 15:41:32,510 query 26660 123145427501056 DEBUG
sql_command: SELECT COUNT(*) AS "__count" FROM "content"
DEBUG 2019-02-01 15:41:34,454 query 26660 123145427501056 DEBUG Aggregation query: [{'$count': '_count'}]
DEBUG 2019-02-01 15:41:34,464 query 26660 123145427501056 DEBUG params: ()
DEBUG 2019-02-01 15:41:34,464 query 26660 123145427501056 DEBUG
sql_command: SELECT "content"."_id", "content"."titles", "content"."credits", "content"."taxonomies", "content"."content_elements", "content"."metadata", "content"."featured_media", "content"."client_id", "content"."origin_id", "content"."origin_url", "content"."origin_url_path", "content"."origin_cms", "content"."origin_slug", "content"."type", "content"."status", "content"."version", "content"."order", "content"."excerpt", "content"."global_slug", "content"."nlp", "content"."created_on", "content"."modified_on", "content"."published_on", "content"."imported_on" FROM "content" ORDER BY "content"."_id" DESC LIMIT 100
DEBUG 2019-02-01 15:41:34,471 query 26660 123145427501056 DEBUG Find query: {'projection': ['_id', 'titles', 'credits', 'taxonomies', 'content_elements', 'metadata', 'featured_media', 'client_id', 'origin_id', 'origin_url', 'origin_url_path', 'origin_cms', 'origin_slug', 'type', 'status', 'version', 'order', 'excerpt', 'global_slug', 'nlp', 'created_on', 'modified_on', 'published_on', 'imported_on'], 'limit': 100, 'sort': [('_id', -1)]}
And here is the project that will help reproduce it
https://github.com/radzhome/djongo-count-bug
this project once you generate more data, the page starts to slow down from initially 800ms load down to about 4-5s when its at 2.5m
The real data, each doc is much bigger so will try with bigger docs next. It takes 30+ seconds to do those queries when the data is bigger in our dev env.
So the issue is with this query db.getCollection('content').aggregate( [{'$count': '_count'}] )
.. very slow vs just doing a count()
Do a db.getCollection('content').aggregate( [{'$count': '_count'}], {explain: true})
, you'll see a
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
Which means that the count will always be slow. You could probably extend the Manager
for that Model
and override the count. You could also open a ticket with mongodb and ask them why this simple indexable query isn't using an index.
If there was a way to just realize the query is a count though in djongo i.e. if pipeline == [{'$count': '_count'}]:
in _get_cursor
then you could just return self.db_ref[self.left_table].count()
but ofcourse this is no longer a cursor but an int which causes issues. You think there is a clean way to do it in djongo?
I wonder if there is a way to force it to use an index, https://docs.mongodb.com/manual/reference/command/count/
When you run just count() on a collection which is the case here, the query is fast
Or if I were to write my own Manager, what query would I want to define for count, given that its doing SELECT COUNT(*) AS "__count" FROM "content"
in order for djongo to use count()
on the collection rather than doing the aggregation?
This fixes it, you want me to create a pr?
def _get_cursor(self):
if self._needs_aggregation():
pipeline = self._make_pipeline()
if pipeline == [{'$count': '_count'}]: # Only thing we are doing is count .. lets speed it up
cnt = self.db_ref[self.left_table].count()
cur = self.db_ref[self.left_table].aggregate([{"$addFields": {"_count": cnt}}, {"$limit": 1},
{'$project': {'_id': 0, '_count': 1}}])
else:
logger.debug(f'Aggregation query start: {pipeline}')
logger.debug("table is {}".format(self.left_table))
import time; s=time.time()
cur = self.db_ref[self.left_table].aggregate(pipeline)
logger.debug('Aggregation query: {} in time {}'.format(pipeline, time.time() -s))
else:
kwargs = {}
if self.where:
kwargs.update(self.where.to_mongo())
if self.selected_columns:
kwargs.update(self.selected_columns.to_mongo())
if self.limit:
kwargs.update(self.limit.to_mongo())
if self.order:
kwargs.update(self.order.to_mongo())
if self.offset:
kwargs.update(self.offset.to_mongo())
import time; s = time.time()
logger.debug(f'Find query: {kwargs}')
cur = self.db_ref[self.left_table].find(**kwargs)
logger.debug('Find query: {} in time {}'.format(kwargs, time.time() - s))
return cur
I added it here to my fork - https://github.com/radzhome/djongo/blob/master/djongo/sql2mongo/query.py#L251
This is more of a bug in MongoDB. It does not use indexing while using aggregation.
Perhaps, but there is a way around it now.
MongoDB does use indexing on aggregations when you use an aggregation step that supports indexing at the beginning of the pipeline (e.g. $match
). The problem here is that the aggregation step { $count: '_count' }
is actually a helper method that expands to
[
{
$group: {
_id: null,
_count: { $sum: 1 }
}
},
{ $project: { _id: 0 } }
]
(https://docs.mongodb.com/manual/reference/operator/aggregation/count/)
Which will run through all documents in the collection if you don't first filter the results. The use of this step is for counting results of an aggregation, not a full collection. Using self.db_ref[self.left_table].count()
actually queries the collection metadata to get the current count of documents.
This is how i fixed this:
from django.db.models.query import QuerySet
from django.db import connections
class CountQuerySet(QuerySet):
def count(self):
is_mongo = 'djongo' in connections[self.db].Database.__name__
query = self.query
if (is_mongo and not query.where and query.high_mark is None and query.low_mark == 0 and
not query.select and
not query.group_by and
not query.distinct):
return self.model.objects.mongo_estimated_document_count()
else:
return super(CountQuerySet, self).count()
# [...]
@admin.register(Books)
class BooksAdmin(admin.ModelAdmin):
# [...]
def get_queryset(self, request):
return CountQuerySet(model=self.model)
One line description of the issue
Python script
I will produce a mini project. When I go to my content model in admin which has 2.6m docs, it hangs and never returns. If I trunk the collection down to say 100k records, it works again.
This operation via mongo client seems to return quickly:
db.getCollection('content').count()
I get 2492309. so I suspect that the query conversion is doing something funky and reading every single doc in the collection to get len/count.Traceback
I only see sql_command: SELECT COUNT(*) AS "__count" FROM "content"
and it hangs, will never return, I have about 2.5m docs in content collection. I'll see if I can reproduce using a mini project.