craftcms / cms

Build bespoke content experiences with Craft.
https://craftcms.com
Other
3.22k stars 626 forks source link

[3.x]: Timout releasing Mutex Lock in an Load balanced environment #11658

Closed thkus closed 6 months ago

thkus commented 2 years ago

What happened?

Description

We are running Craft CMS in an Azure Kubernetes Cluster. To maintain sessions we use an instance of Azure's Redis. Now once we trigger a new deployment via our pipeline, a companion Pod starts as a Job to run ./craft up.

At first everything was working fine but after of some time running the application in production without any issues we noticed that ./craft up runs very long and then throws an exception (see below). This somehow indicates that it is also might be connected to the amount of data. It seems that the lock for craft-up is not being released. The Redis metrics and logs show no sign of issues.

As another consequence the failing task has left our volume corrupted which caused a compiled template to be created over and over again because it was not possible to delete the file anymore due to a stale delete flag. The only way to fix this was to create a new mount.

Do you have any idea what might cause this? As a result we have now switched from using ./craft up to executing each step individually. This seems to work fine. Our configuration for Redis is like described in the article on how to run Craft CMS in load balanced environments.

➜  git:(main) ✗ kubectl logs -n craftcms jobs.batch/craftcms-migrate-e1e86842eba4a3bf6d0fb0523804045e9646249f
Found 2 pods, using pod/craftcms-migrate-e1e86842eba4a3bf6d0fb0523804045e9646249f-czbcq
🔒 Acquiring lock ... done

Checking for pending Craft and plugin migrations ...
No new migrations found. Your system is up to date.

Applying changes from your project config files ...
- updating entryTypes.ff2b1278-c26d-4225-a123-d80c2064b102.fieldLayouts.2d0fe877-06a2-4a62-85b2-2dfa274ebaa5.tabs.0.elements.1 ...
  - updating entryTypes.ff2b1278-c26d-4225-a123-d80c2064b102 ... done
  done
- updating entryTypes.ff2b1278-c26d-4225-a123-d80c2064b102.fieldLayouts.2d0fe877-06a2-4a62-85b2-2dfa274ebaa5.tabs.0.elements.2 ... done
- updating entryTypes.ff2b1278-c26d-4225-a123-d80c2064b102.fieldLayouts.2d0fe877-06a2-4a62-85b2-2dfa274ebaa5.tabs.0.elements.3 ... done
- updating users.groups.1fef851a-bd46-4777-ae00-dae7421c481d.permissions ...
  - updating users.groups.1fef851a-bd46-4777-ae00-dae7421c481d ... done
  - updating users.groups.ff6ebfca-f4b6-4d1b-aac9-82fac548bbe3 ... done
  done
- updating users.groups.ff6ebfca-f4b6-4d1b-aac9-82fac548bbe3.permissions ... done
- updating dateModified ... done
- adding globalSets.c3b071a0-1d9a-4445-bf53-0b51ccd28fdb.fieldLayouts.844e1376-e27d-4c9e-8931-7aa5160f2964.tabs.0.elements.0 ...
  - adding globalSets.c3b071a0-1d9a-4445-bf53-0b51ccd28fdb ... done
  done
- adding globalSets.c3b071a0-1d9a-4445-bf53-0b51ccd28fdb.fieldLayouts.844e1376-e27d-4c9e-8931-7aa5160f2964.tabs.0 ... done
- updating meta.__names__ ... done
Finished applying changes

Checking for pending content migrations ...
No new migrations found. Your system is up-to-date.

Error: Failed to read from socket.
Redis command was: EVAL if redis.call("GET",KEYS[1])==ARGV[1] then
    return redis.call("DEL",KEYS[1])
else
    return 0
end 1 39e163799360c1e938270ea668ec756fef1cb 3gUUw52QXxS0QE_4vri1
PHP Fatal error:  Uncaught yii\redis\SocketException: Failed to write to socket. 0 of 201 bytes written.
Redis command was: *5
$4
EVAL
$101
if redis.call("GET",KEYS[1])==ARGV[1] then
    return redis.call("DEL",KEYS[1])
else
    return 0
end
$1
1
$37
39e163799360c1e938270ea668ec756fef1cb
$20
3gUUw52QXxS0QE_4vri1
 in /app/vendor/yiisoft/yii2-redis/src/Connection.php:827
Stack trace:
#0 /app/vendor/yiisoft/yii2-redis/src/Connection.php(796): yii\redis\Connection->sendRawCommand('*5\r\n$4\r\nEVAL\r\n$...', Array)
#1 /app/vendor/yiisoft/yii2-redis/src/Mutex.php(146): yii\redis\Connection->executeCommand('EVAL', Array)
#2 /app/vendor/yiisoft/yii2/mutex/Mutex.php(88): yii\redis\Mutex->releaseLock('craft-up')
#3 /app/vendor/craftcms/cms/src/mutex/Mutex.php(75): yii\mutex\Mutex->release('craft-up')
#4 /app/vendor/yiisoft/yii2/mutex/Mutex.php(88): craft\mutex\Mutex->releaseLock('craft-up')
#5 /app/vendor/craftcms/cms/src/mutex/MutexTrait.php(82): yii\mutex\Mutex->release('craft-up')
#6 /app/vendor/yiisoft/yii2/mutex/Mutex.php(57): craft\mutex\Mutex->release('craft-up')
#7 [internal function]: yii\mutex\Mutex->yii\mutex\{closure}()
#8 {main}
Additional Information:
Array
(
)

  thrown in /app/vendor/yiisoft/yii2-redis/src/Connection.php on line 827
PHP Warning:  Unknown: SSL: Connection reset by peer in Unknown on line 0
🔓 Releasing lock ... %  

Steps to reproduce

Unfortunately we don't really know how to make this reproducible

Expected behavior

./craft up executes and releases the lock not causing a downtime

Actual behavior

./craft up blocks and causes a downtime on every re-deploy.

Craft CMS version

3.7.48

PHP version

8.0.20

Operating system and version

No response

Database type and version

MariaDB 10.3

Image driver and version

No response

Installed plugins and versions

angrybrad commented 2 years ago

It almost sounds like it's bumping into a Redis connection time limit, then Redis kills the connection and you get the "Error: Failed to read from socket." error.

Not that familiar with Azure's Redis implementation, but do you see any settings around being able to increase that?

https://redis.io/docs/reference/clients/#client-timeouts

https://redis.io/docs/reference/clients/#client-eviction

thkus commented 2 years ago

I checked the settings and the docs, the default timeout is 10 minutes. Which correlates with what we are seeing: Usually after 10 minutes the exception is thrown and the Job finishes. I assume 10 minutes should be sufficient in general, right?

https://techcommunity.microsoft.com/t5/azure-paas-blog/azure-redis-timeouts-client-side-issues/ba-p/2022203#:~:text=Azure%20Cache%20for%20Redis%20service,side%20and%20cannot%20be%20changed.

angrybrad commented 2 years ago

Hrm... are there any sort of Redis logs exposed to you that might hold some clues when it happens?

thkus commented 2 years ago

Unfortunately no sign of any issue in the logs. But it somehow seems to be related to the part where the cache is cleared running ./craft up. We also opened a Support Ticket with Microsoft but they also didn't find any issue whatsoever.

angrybrad commented 2 years ago

Might help to remove some variables from the equation to try and narrow down the issue? i.e. remove Azure Kubernetes Cluster and get the sight running on a straight VM (I'm not sure what Azure calls them) + same database + redis setup and see if you get the same behavior?

thkus commented 2 years ago

We are running this kind of setup in different environments (docker, docker swarm, Kubernetes with Rancher). The only difference here is, that there is Azure's Redis involved. So it probably boils down to that in the end. Although we don't really have an explanation why. The only idea that i come up with is, that there might be some sort of race condition.

So if you also have no idea of what might be causing this, i guess there is nothing we can really do right now.

angrybrad commented 6 months ago

Going to go ahead and close this as stale, but comment back OP if anything came from it, and we can re-open as necessary.