doableware / djongo

Django and MongoDB database connector
https://www.djongomapper.com
GNU Affero General Public License v3.0
1.88k stars 353 forks source link

count sql is very slow, using mongo 3.6, with 2.5m records #233

Open radzhome opened 5 years ago

radzhome commented 5 years ago

One line description of the issue

Python script

I will produce a mini project. When I go to my content model in admin which has 2.6m docs, it hangs and never returns. If I trunk the collection down to say 100k records, it works again.

This operation via mongo client seems to return quickly:

db.getCollection('content').count() I get 2492309. so I suspect that the query conversion is doing something funky and reading every single doc in the collection to get len/count.

Traceback

I only see sql_command: SELECT COUNT(*) AS "__count" FROM "content"

and it hangs, will never return, I have about 2.5m docs in content collection. I'll see if I can reproduce using a mini project.

thestick613 commented 5 years ago

Enable performance logging in mongo and try to figure out what query is being run? https://docs.mongodb.com/manual/reference/method/db.setLogLevel/

radzhome commented 5 years ago

You will update that shortly, just trying to get output from db.currentOp() when the query is run.

radzhome commented 5 years ago

Sorry bit behind with this. I'm going to create a small project that reproduces the issue.

radzhome commented 5 years ago

So here is the queries I see

DEBUG 2019-02-01 15:41:30,457 query 26660 123145427501056 DEBUG
 sql_command: SELECT COUNT(*) AS "__count" FROM "content"
DEBUG 2019-02-01 15:41:32,509 query 26660 123145427501056 DEBUG Aggregation query: [{'$count': '_count'}]
DEBUG 2019-02-01 15:41:32,509 query 26660 123145427501056 DEBUG params: ()
DEBUG 2019-02-01 15:41:32,510 query 26660 123145427501056 DEBUG
 sql_command: SELECT COUNT(*) AS "__count" FROM "content"
DEBUG 2019-02-01 15:41:34,454 query 26660 123145427501056 DEBUG Aggregation query: [{'$count': '_count'}]
DEBUG 2019-02-01 15:41:34,464 query 26660 123145427501056 DEBUG params: ()
DEBUG 2019-02-01 15:41:34,464 query 26660 123145427501056 DEBUG
 sql_command: SELECT "content"."_id", "content"."titles", "content"."credits", "content"."taxonomies", "content"."content_elements", "content"."metadata", "content"."featured_media", "content"."client_id", "content"."origin_id", "content"."origin_url", "content"."origin_url_path", "content"."origin_cms", "content"."origin_slug", "content"."type", "content"."status", "content"."version", "content"."order", "content"."excerpt", "content"."global_slug", "content"."nlp", "content"."created_on", "content"."modified_on", "content"."published_on", "content"."imported_on" FROM "content" ORDER BY "content"."_id" DESC  LIMIT 100
DEBUG 2019-02-01 15:41:34,471 query 26660 123145427501056 DEBUG Find query: {'projection': ['_id', 'titles', 'credits', 'taxonomies', 'content_elements', 'metadata', 'featured_media', 'client_id', 'origin_id', 'origin_url', 'origin_url_path', 'origin_cms', 'origin_slug', 'type', 'status', 'version', 'order', 'excerpt', 'global_slug', 'nlp', 'created_on', 'modified_on', 'published_on', 'imported_on'], 'limit': 100, 'sort': [('_id', -1)]}

And here is the project that will help reproduce it

https://github.com/radzhome/djongo-count-bug

this project once you generate more data, the page starts to slow down from initially 800ms load down to about 4-5s when its at 2.5m

The real data, each doc is much bigger so will try with bigger docs next. It takes 30+ seconds to do those queries when the data is bigger in our dev env.

radzhome commented 5 years ago

So the issue is with this query db.getCollection('content').aggregate( [{'$count': '_count'}] ) .. very slow vs just doing a count()

thestick613 commented 5 years ago

Do a db.getCollection('content').aggregate( [{'$count': '_count'}], {explain: true}), you'll see a

"winningPlan" : {
            "stage" : "COLLSCAN",
            "direction" : "forward"
},

Which means that the count will always be slow. You could probably extend the Manager for that Model and override the count. You could also open a ticket with mongodb and ask them why this simple indexable query isn't using an index.

radzhome commented 5 years ago

If there was a way to just realize the query is a count though in djongo i.e. if pipeline == [{'$count': '_count'}]: in _get_cursor then you could just return self.db_ref[self.left_table].count() but ofcourse this is no longer a cursor but an int which causes issues. You think there is a clean way to do it in djongo?

radzhome commented 5 years ago

I wonder if there is a way to force it to use an index, https://docs.mongodb.com/manual/reference/command/count/

When you run just count() on a collection which is the case here, the query is fast

radzhome commented 5 years ago

Or if I were to write my own Manager, what query would I want to define for count, given that its doing SELECT COUNT(*) AS "__count" FROM "content" in order for djongo to use count() on the collection rather than doing the aggregation?

radzhome commented 5 years ago

This fixes it, you want me to create a pr?

    def _get_cursor(self):
        if self._needs_aggregation():
            pipeline = self._make_pipeline()
            if pipeline == [{'$count': '_count'}]:  # Only thing we are doing is count .. lets speed it up
                cnt = self.db_ref[self.left_table].count()
                cur = self.db_ref[self.left_table].aggregate([{"$addFields": {"_count": cnt}}, {"$limit": 1},
                                                              {'$project': {'_id': 0,  '_count': 1}}])
            else:
                logger.debug(f'Aggregation query start: {pipeline}')
                logger.debug("table is {}".format(self.left_table))
                import time; s=time.time()
                cur = self.db_ref[self.left_table].aggregate(pipeline)
                logger.debug('Aggregation query: {} in time {}'.format(pipeline, time.time() -s))
        else:
            kwargs = {}
            if self.where:
                kwargs.update(self.where.to_mongo())

            if self.selected_columns:
                kwargs.update(self.selected_columns.to_mongo())

            if self.limit:
                kwargs.update(self.limit.to_mongo())

            if self.order:
                kwargs.update(self.order.to_mongo())

            if self.offset:
                kwargs.update(self.offset.to_mongo())

            import time; s = time.time()
            logger.debug(f'Find query: {kwargs}')
            cur = self.db_ref[self.left_table].find(**kwargs)
            logger.debug('Find query: {} in time {}'.format(kwargs, time.time() - s))

        return cur
radzhome commented 5 years ago

I added it here to my fork - https://github.com/radzhome/djongo/blob/master/djongo/sql2mongo/query.py#L251

nesdis commented 5 years ago

This is more of a bug in MongoDB. It does not use indexing while using aggregation.

radzhome commented 5 years ago

Perhaps, but there is a way around it now.

ryanm-pm commented 4 years ago

MongoDB does use indexing on aggregations when you use an aggregation step that supports indexing at the beginning of the pipeline (e.g. $match). The problem here is that the aggregation step { $count: '_count' } is actually a helper method that expands to

[
    {
        $group: {
            _id: null,
            _count: { $sum: 1 }
        }
    },
    { $project: { _id: 0 } }
]

(https://docs.mongodb.com/manual/reference/operator/aggregation/count/)

Which will run through all documents in the collection if you don't first filter the results. The use of this step is for counting results of an aggregation, not a full collection. Using self.db_ref[self.left_table].count() actually queries the collection metadata to get the current count of documents.

thestick613 commented 4 years ago

This is how i fixed this:

from django.db.models.query import QuerySet
from django.db import connections

class CountQuerySet(QuerySet):
    def count(self):
        is_mongo = 'djongo' in connections[self.db].Database.__name__
        query = self.query

        if (is_mongo and not query.where and query.high_mark is None and query.low_mark == 0 and
                not query.select and
                not query.group_by and
                not query.distinct):
            return self.model.objects.mongo_estimated_document_count()

        else:
            return super(CountQuerySet, self).count()

# [...]

@admin.register(Books)
class BooksAdmin(admin.ModelAdmin):
    # [...]
    def get_queryset(self, request):
        return CountQuerySet(model=self.model)