dagolden / MongoDBx-Queue

Work queue using MongoDB
https://metacpan.org/author/DAGOLDEN
1 stars 0 forks source link

High CPU usage on MongoDB due to slow request when using a huge Queue #4

Open Zer0-Tolerance opened 5 years ago

Zer0-Tolerance commented 5 years ago

Hi David,

Great module very useful to use a very large queue, but it seems to have an issue with collection of more than a million documents. Here is the result of the slow queries ie > 3 secs db.currentOp({"secs_running": {$gte: 3}})

/ 1 / { "inprog" : [ { "host" : "recon.local:27017", "desc" : "conn1096", "connectionId" : 1096, "client" : "127.0.0.1:51081", "clientMetadata" : { "driver" : { "name" : "MongoDB Perl Driver", "version" : "2.0.3" }, "os" : { "type" : "darwin" }, "platform" : "Perl v5.24.4 darwin-thread-multi-2level" }, "active" : true, "currentOpTime" : "2019-06-10T17:43:56.835+0200", "opid" : 18276052, "secs_running" : NumberLong(5), "microsecs_running" : NumberLong(5256346), "op" : "command", "ns" : "NScan.hostq", "command" : { "aggregate" : "hostq", "pipeline" : [ { "$match" : { "_r" : { "$exists" : false } } }, { "$group" : { "n" : { "$sum" : 1 }, "_id" : null } } ], "cursor" : {}, "$readPreference" : { "mode" : "secondaryPreferred" }, "$db" : "NScan" }, "planSummary" : "COLLSCAN", "numYields" : 37202, "locks" : {}, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(74408) } }, "Database" : { "acquireCount" : { "r" : NumberLong(37204) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(37204) } } } }, ... ], "ok" : 1.0 }

After some testing it's the group operation in the aggregate that is very slow db.getCollection('hostq').aggregate([ {$match: {"_r" : {"$exists" : false}}}, // {"$group" : {"n" : {"$sum" : 1},"_id" : null}} ]) -> 0.037 secs db.getCollection('hostq').aggregate([ {$match: {"_r" : {"$exists" : false}}}, {"$group" : {"n" : {"$sum" : 1},"_id" : null}} ]) -> 13 secs

Any chance you can fix this and update the module ?

xdg commented 5 years ago

Hi, that should be fixable with an appropriate index. If you add an index {"_r":1, "_p":1}, does that get rid of the slow query? (Also, can you confirm whether you have an existing index on "_p"?)

Zer0-Tolerance commented 5 years ago

already checked that and added an Index on _r but still very slow ... indexSizes" : { "_id_" : 226652160, "_p_1" : 46321664, "_r_1" : 21610496 },

Any other idea ? I'm using Mongo 3.6

xdg commented 5 years ago

I was suggesting a joint index on _r and _p, with _r first, following the rule of "equality, sort, range" for indexes. That should do better than an individual index as it should avoid the collection scan (COLLSCAN).

Zer0-Tolerance commented 5 years ago

just tried the compound index but same result still 3 secs with 2.6 millions docs.

xdg commented 5 years ago

Ok, I'll have to run some experiments locally, but I'm slammed at work and then on vacation next week (which is why I'm slammed this week). So please be patient, but I will look into it. This should be possible without a collection scan.

xdg commented 5 years ago

Looking again at your report, nothing in MongoDBx::Queue is issuing an aggregate or a $group pipeline stage -- so I think this must be some other query you're running outside MongoDBx::Queue.

I also found SERVER-29444 for improving performance of $group based on indexes. You might find some insight there and in linked tickets. For example, it's possible that indexing on the group field and sorting it first could help, but that's getting outside my area of expertise.

Zer0-Tolerance commented 5 years ago

I've just rechecked but I'm not doing any query on the collection used by MongoDBx::Queue so it must be coming from MongoDBx maybe ?

xdg commented 5 years ago

Oh, I see now. That's the new underlying query generate from count_documents. Great! Now I know what to investigate.

xdg commented 5 years ago

I take it you're calling waiting -- that would be the origin of the slow query. In the meantime, if you can avoid that call, you'll see performance improve.

Zer0-Tolerance commented 5 years ago

true I've changed to size it's better but underlying queries are using COLLSCAN which is surprising given we get better results than waiting that is using IXSCAN (index). Nevertheless I'm still getting this 500 % CPU usage but it's not starting the CPU FAN anymore ...

xdg commented 5 years ago

There's a design problem in that the "_r" field is created and removed and that make is hard to index -- even a sparse index doesn't always get picked up properly by the query planner. The fix would be to change the internals to create "_r" always with some value like 0 meaning "not reserved", and then indexing would work correctly. The problem is that's a breaking change, so it's a major version bump and I'd need to provide some way for people to upgrade their old data, too. That's what I'm currently thinking through.

Zer0-Tolerance commented 5 years ago

I fully understand your problem but it would be good to get better performance :) Let me know if you need some beta testing.

Zer0-Tolerance commented 5 years ago

Any progress ?

xdg commented 5 years ago

A bit. Conceptually, I'm going to rewrite MongoDBx::Queue as a facade in front of two implementation classes. Default will be legacy, but users can opt into v2. Then migration is just a matter of reserving tasks from the old queue and inserting them into a new, v2, queue (separate collection). That should avoid surprising users on update and allow users control over migration -- whether to shut down all old work processing and migrate or whether to keep old running while migrating to a new queue.

I'm not getting a lot of time to hack on it around family obligations, but at least I have now have a plan and will keep chipping away. I'm working on migrating legacy to the facade model and when that passes testing, I'll work on v2. Worst case, I'm back from vacation next week and can bang something out then.

xdg commented 5 years ago

Here's a test tarball: https://www.dropbox.com/s/ypl3ia5v1ze92e2/MongoDBx-Queue-2.001-TRIAL.tar.gz?dl=1

I did some local testing (at 100k documents) and a query comparable to waiting uses an index rather than a collection scan.

You'll need to migrate/copy your jobs from the v1 queue to another collection/database name for the v2 queue (using version => 2 in the queue options) and test there. That's not well documented yet.

Zer0-Tolerance commented 5 years ago

great thks ! I'll do some test this week end and let you know.

Zer0-Tolerance commented 5 years ago

Hi David, I still have some slow queries like this one with q very big queue:

inprog" : [ 
        {
            "host" : "recon.home:27017",
            "desc" : "conn14",
            "connectionId" : 14,
            "client" : "127.0.0.1:49381",
            "clientMetadata" : {
                "driver" : {
                    "name" : "MongoDB Perl Driver",
                    "version" : "2.0.3"
                },
                "os" : {
                    "type" : "darwin"
                },
                "platform" : "Perl v5.24.4 darwin-thread-multi-2level"
            },
            "active" : true,
            "currentOpTime" : "2019-06-25T07:55:50.672+0200",
            "opid" : 87194,
            "secs_running" : NumberLong(2),
            "microsecs_running" : NumberLong(2856433),
            "op" : "command",
            "ns" : "NScan.hostq",
            "command" : {
                "aggregate" : "hostq",
                "pipeline" : [ 
                    {
                        "$match" : {
                            "_x" : false
                        }
                    }, 
                    {
                        "$group" : {
                            "_id" : null,
                            "n" : {
                                "$sum" : 1
                            }
                        }
                    }
                ],
                "cursor" : {},
                "$readPreference" : {
                    "mode" : "secondaryPreferred"
                },
                "$db" : "NScan"
            },
            "planSummary" : "COUNT_SCAN { _x: 1, _p: 1 }",
            "numYields" : 25632,
            "locks" : {},
            "waitingForLock" : false,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(51268)
                    }
                },
                "Database" : {
                    "acquireCount" : {
                        "r" : NumberLong(25634)
                    }
                },
                "Collection" : {
                    "acquireCount" : {
                        "r" : NumberLong(25634)
                    }
                }
            }
        }
    ],
    "ok" : 1.0
xdg commented 5 years ago

I've discussed this with the query team. The COUNT_SCAN plan is the best the database is going to do -- it's walking the index and incrementing a counter without touching any underlying data and you've just got a lot of things to count. I suspect the only way to make it faster is to scale up the hardware the database is running on.