craftcms / cms

Build bespoke content experiences with Craft.
https://craftcms.com
Other
3.22k stars 627 forks source link

Queue job loops indefinitely: "Could not acquire a mutex lock for the queue" #10189

Closed benjaminkohl closed 2 years ago

benjaminkohl commented 2 years ago

Description

We are experiencing a semi-urgent matter with Craft 3.7.20 and these mutex lock errors with the queue. This error is resulting in a Craft Commerce order email notification being sent over and over indefinitely until someone notices that the site has gone down and we have to restart MySQL.

This particular site was updated to 3.7.20 from 3.7.10 on November 11th and we never experienced this issue prior to this update. We noticed there were some changes regarding mutex locking in Craft 3.7.15. We also saw this existing issue but for us, it doesn't seem to be related to a large quantity of queue jobs being queued up at once.

Our queue.log file(s) have these two lines repeated 34,000+ times over the span of 18 minutes. These are standard Craft Commerce email notifications.

2021-11-30 10:28:44 [-][160236][2pms2h72hic94sgfn88unfijgn][info][craft\queue\QueueLogBehavior::beforeExec]  [76032] Sending email for order #XXXX (attempt: 1, pid: 3810572) - Started
2021-11-30 10:28:44 [-][160236][2pms2h72hic94sgfn88unfijgn][info][craft\queue\QueueLogBehavior::afterExec]  [76032] Sending email for order #XXXX (attempt: 1, pid: 3810572) - Done (time: 0.058s)

Here is the stack trace for the mutex lock Exception that gets logged when someone tries to access the control panel (or maybe the site in general) while it is down:

2021-11-30 10:28:54 [-][124][dccgjhhrosmo4vvicht52m5esr][error][yii\base\Exception] yii\base\Exception: Could not acquire a mutex lock for the queue (queue). in /home/account/site-prod/vendor/craftcms/cms/src/queue/Queue.php:780
Stack trace:
#0 /home/account/site-prod/vendor/craftcms/cms/src/queue/Queue.php(684): craft\queue\Queue->_lock()
#1 /home/account/site-prod/vendor/craftcms/cms/src/queue/Queue.php(437): craft\queue\Queue->_moveExpired()
#2 /home/account/site-prod/vendor/twig/twig/src/Extension/CoreExtension.php(1511): craft\queue\Queue->getJobInfo()
#3 /home/account/site-prod/vendor/craftcms/cms/src/helpers/Template.php(106): twig_get_attribute()
#4 /home/account/site-prod/storage/runtime/compiled_templates/f9/f947432bf0169eafca0e04c60fc45ba9b66227c7d2ad7bbafac0a0cbb6e6de05.php(60): craft\helpers\Template::attribute()
#5 /home/account/site-prod/vendor/twig/twig/src/Template.php(405): __TwigTemplate_ae102c391269b7a8c64cc6fd2e478b83e27770d3e10df2b6d12eb4fd5d0e99dc->doDisplay()
#6 /home/account/site-prod/vendor/twig/twig/src/Template.php(378): Twig\Template->displayWithErrorHandling()
#7 /home/account/site-prod/storage/runtime/compiled_templates/5f/5fb3fceec3c04bc7054bc6f9fa4422cd731d07112af974760fa8154f13e05e0f.php(47): Twig\Template->display()
#8 /home/account/site-prod/vendor/twig/twig/src/Template.php(405): __TwigTemplate_4186714b0bdcd209bf9d9bc12c258c174ba9e6cda28cfa53f414f3c727f1a74a->doDisplay()
#9 /home/account/site-prod/vendor/twig/twig/src/Template.php(378): Twig\Template->displayWithErrorHandling()
#10 /home/account/site-prod/vendor/twig/twig/src/Template.php(390): Twig\Template->display()
#11 /home/account/site-prod/vendor/twig/twig/src/TemplateWrapper.php(45): Twig\Template->render()
#12 /home/account/site-prod/vendor/twig/twig/src/Environment.php(318): Twig\TemplateWrapper->render()
#13 /home/account/site-prod/vendor/craftcms/cms/src/web/View.php(408): Twig\Environment->render()
#14 /home/account/site-prod/vendor/craftcms/cms/src/web/View.php(461): craft\web\View->renderTemplate()
#15 /home/account/site-prod/vendor/craftcms/cms/src/web/Controller.php(201): craft\web\View->renderPageTemplate()
#16 /home/account/site-prod/vendor/craftcms/cms/src/controllers/DashboardController.php(142): craft\web\Controller->renderTemplate()
#17 [internal function]: craft\controllers\DashboardController->actionIndex()
#18 /home/account/site-prod/vendor/yiisoft/yii2/base/InlineAction.php(57): call_user_func_array()
#19 /home/account/site-prod/vendor/yiisoft/yii2/base/Controller.php(181): yii\base\InlineAction->runWithParams()
#20 /home/account/site-prod/vendor/yiisoft/yii2/base/Module.php(534): yii\base\Controller->runAction()
#21 /home/account/site-prod/vendor/craftcms/cms/src/web/Application.php(287): yii\base\Module->runAction()
#22 /home/account/site-prod/vendor/yiisoft/yii2/web/Application.php(104): craft\web\Application->runAction()
#23 /home/account/site-prod/vendor/craftcms/cms/src/web/Application.php(272): yii\web\Application->handleRequest()
#24 /home/account/site-prod/vendor/yiisoft/yii2/base/Application.php(392): craft\web\Application->handleRequest()
#25 /home/account/site-prod/public/index.php(24): yii\base\Application->run()
#26 {main}

Steps to reproduce

  1. We haven't figured out a rhyme or reason for when this happens other than an occasional Craft Commerce order notification (twice in the last week).

Additional info

Amazon S3 1.2.15 Cache Flag 1.2.4 CP Field Inspect 1.2.5 CP Filters 1.1.2 Craft Commerce 3.4.7 Embedded Assets 2.8.0 Google Maps 4.0.11 Guest Entries 2.4.0 Imager X v3.5.2 Inventory 2.1.1 Lab Reports 1.0.0 Link Vault 3.1.7 Minify 1.2.10 PayPal Checkout for Craft Commerce 1.2.2.2 Preparse Field 1.2.5 Reasons 2.2.5 reCAPTCHA 1.5.2 recaptchaguest 1.0.5 Redactor 2.8.8 Retour 3.1.60 SEOmatic 3.4.1 Stripe for Craft Commerce 2.4.0 Super Table 2.6.8

angrybrad commented 2 years ago

Do you have multiple queue runners for this site, or just one?

Are you using Craft's native web-based queue runner or have you customized it in config/app.php?

benjaminkohl commented 2 years ago

The site was only using the standard web-based queue runner but when I was looking into the issue, I disabled runQueueAutomatically in the config and set up the queue to run every minute via a cron job to see if that might help. That's generally how we configure sites anyway.

auralon commented 2 years ago

We are experiencing this issue too. We use the default queue runner.

The issue is apparent for us (and the client) when duplicating a Commerce product. SEOmatic attempts to update the sitemap for the product type group at that point, which then attempts to run the queue, at which point it hits the "Could not acquire a mutex lock for the queue (queue)." exception.

brandonkelly commented 2 years ago

@auralon Recent versions of SEOmatic no longer run the queue directly, except in one case (front-end sitemap request, when the sitemap needs to be generated). Can you make sure you are up-to-date?

brandonkelly commented 2 years ago

@benjaminkohl Same to you. The SEOmatic change was made in 3.4.18.

khalwat commented 2 years ago

Update the latest version of SEOmatic. In 3.4.18 a change was made so that queues are never run manually in the circumstance that you're describing: https://github.com/nystudio107/craft-seomatic/blob/develop/CHANGELOG.md#3418---20211111

lmk if you still experience issues.

khalwat commented 2 years ago

Also in general, I'd suggest running Craft queues via a queue runner instead of via web request, as per: https://nystudio107.com/blog/robust-queue-job-handling-in-craft-cms

benjaminkohl commented 2 years ago

Does this mean that our issue was somehow related to the version of Seomatic we currently have installed on the site?

brandonkelly commented 2 years ago

@benjaminkohl That’s the theory.

khalwat commented 2 years ago

@benjaminkohl I haven't looked into it in depth. but I think it's a combination of SEOmatic trying to run the queue manually combined with changes in Craft that attempted to guard against queues being run simultaneously (thus the mutex lock error).

This issue should only happen if queues are run via the default web request method.

So it's something I probably shouldn't have been doing, but it wasn't an issue in terms of throwing a mutex lock error until some (good) changes were made in Craft.

The version of SEOmatic running on the site is from August 2021, so probably updating it is a good idea.

After talking with Brandon, I've further removed any queue running logic except via console command in: https://github.com/nystudio107/craft-seomatic/commit/eaac9ee6c0c9ed0cb2511ad187545d9fd7ebb0fc

You can try it now by setting your semver in your composer.json to look like this:

    "nystudio107/craft-seomatic": "dev-develop as 3.4.21”,

Then do a composer clear-cache && composer update

benjaminkohl commented 2 years ago

Okay, thank you. I will look into trying that release with this site. I am sure we have a number of other sites running an even older version. It's hard to stay caught up with everything with so many sites.

khalwat commented 2 years ago

@benjaminkohl if it helps -> https://nystudio107.com/blog/updating-craft-cms-without-headaches

khalwat commented 2 years ago

To be clear, the current release version of SEOmatic should also fix the issue for you (assuming this is the issue), in case you don't want to update to the in-progress develop version.

angrybrad commented 2 years ago

@benjaminkohl going to go ahead and close this assuming that's the issue, but if you still hit it, comment back and we'll dig in further.

adrienne commented 2 years ago

We're also having this issue on a site that just went live, and we are on 3.4.20 of SEOMatic. I'll see if i can nail down more info.

khalwat commented 2 years ago

If you're still seeing this issue with the develop version of SEOmatic as discussed above, I think the issue may not be related to SEOmatic. That version does not attempt to manually run the queue ever.

The only constant I'm seeing is everyone is also running Commerce? Is that correct?

adrienne commented 2 years ago

@khalwat we are not running Commerce, and we are definitely still seeing the bug as of this morning. A complete system report for us is below just for everyone's reference:

Application Info

Plugins

Modules

khalwat commented 2 years ago

If you're using the develop version of SEOmatic, I can say with some confidence that it isn't SEOmatic-related.

adrienne commented 2 years ago

@khalwat @angrybrad Yeah, i am pretty sure it's not SEOMatic-related, but it is definitely an issue that someone needs to fix!

brandonkelly commented 2 years ago

@adrienne Probably best if you get in touch with us directly at support@craftcms.com. We can help look into it from there.

adrienne commented 2 years ago

@brandonkelly - happy to, but there's at least two other folks on the Discord also having the bug, fyi!

peteralewis commented 2 years ago

Any updates to this issue? The latest I can find on Discord coincides with what Andrew has stated above, that it's doubtful that it's SEOmatic.

angrybrad commented 2 years ago

@peteralewis Any chance we can SSH/Craft control panel access to the box you're seeing this on along with some steps to reproduce? If so, can you contact us at support@craftcms.com?

peteralewis commented 2 years ago

The error is preventing any access to the CP for any of the users. I was about to ssh to the box and clear the queue via the command line see if that helps. It cleared itself early and we got access, but it's tripped over again. I'll continue on email support.

howells commented 2 years ago

I've experienced this today with Craft 3.7.26: initially when editing something in SEOmatic 3.4.21, but then when I edited entries as normal. I was only able to resolve it by restarting the server. If it happens again I'll contact support@ but seems like something is up.

khalwat commented 2 years ago

@howells that version of SEOmatic never attempts to run the queue by itself, it's all been ripped out.

So it's unlikely related to the plugin, if that helps.

heymarkreeves commented 2 years ago

FWIW, this is in my Craft logs today on a site running Craft CMS 3.6.17, without SEOmatic installed. It looks like it fired on login and prevented admins from logging in to the control panel. This site was running the queue using systemd; it's possible something broke that and there was a big backlog to run.

timkelty commented 2 years ago

MySQL had bugs related to mutex locks that were not fixed until 5.7 It looks like both @adrienne and @benjaminkohl are on 5.5.5.

Can anyone experiencing this (@heymarkreeves @howells @peteralewis) post their MySQL (or Postgres) version?

peteralewis commented 2 years ago

MySQL 8.0.23. Hasn't repeated since the incident on the 10th.

heymarkreeves commented 2 years ago

I think this was on a server that shows it's running MySQL 8.0.27. (Arcustech Craft-optimized)

I had some other issues going on with a lot of queued up jobs cleaning up revisions.

brandonkelly commented 2 years ago

We have identified at least one culprit for these mutex errors: MySQL’s erratic locking behavior when coupled with DB transactions.

So we have dropped MysqlMutex as the default mutex driver in Craft 3.7.30. There’s no reliable multi-server alternative available in core, so we’re just back to FileMutex by default with a recommendation to replace it with something like yii\redis\Mutex on load-balanced environments. (See #10385 for a full explanation of the changes.)

jamesmacwhite commented 2 years ago

Just in case someone else comes across this, I want to provide a potential caveat/edge case of the file mutex now being default. We had a Craft CMS site using memcache on our staging and live environments, but locally on dev, using Redis which is a bit of mix, but just because Nitro provides the capability effortlessly. The mutex component had not been modified in the app.php since the 3.7.20 change. We found on staging and live constant mutex locking errors with the queue but not on dev no matter how much we tried to reproduce it. After lengthy tracing it was determined that certain Craft::$app->getCache() and setCache() calls in a module (custom FeedMe feed type) were falling back to the FileMutex, because this cached data exceeded the max 1 MB limit of a single memcache key, so therefore it would not be cached in memcache on staging/live but instead use the FileMutex. This is very problematic on a load balanced environment and further compounded by our environments being on ephemeral storage, so there is zero assurance that the file cache can operate properly like this and thus the queue constantly starts triggered mutex errors when these feeds using such calls are in use and the Control Panel random displays fail flash messages, because of the mutex error raised.

After deploying Redis across all environments and setting the mutex component globally to Redis this then resolved the issues and further more allows larger data greater than 1 MB to be cached in Redis, which otherwise cannot be stored in memcache

So just be aware of memcache and it's 1 MB limit, then falling back to the FileMutex without any real indication. This is possibly a bit of a edge case, given who's calling getCache and setCache in a module in a custom FeedMe feed type, probably not many people which is equally holding data larger than 1 MB i.e. a Guzzle HTTP response.

I appreciate that it was basically spelled out that Redis is recommended for load balanced environments although isn't strictly required, but certainly in comparison to memcache, is probably the better caching solution overall, not least because you can set a reliable mutex, but equally as it probably provides better caching performance overall, with the exception of memcache being multi threaded, so some scenarios it may not, but given most benefits of Redis out weigh memcache, it's the better solution these days by the looks of it.

It does also mean you have a consistent environment from dev up to live with the same caching mechanism in play as well. Although we don't usually run caching unless in production.

I'd imagine if I set the mutex back to the MySQL one, it might have worked as well, although I gather the whole intention around the change was because the MysqlMutex in itself is a problem and could still trigger mutex locking anyway, although perhaps not in an excessive fashion like we had.