Closed thkus closed 6 months ago
It almost sounds like it's bumping into a Redis connection time limit, then Redis kills the connection and you get the "Error: Failed to read from socket." error.
Not that familiar with Azure's Redis implementation, but do you see any settings around being able to increase that?
I checked the settings and the docs, the default timeout is 10 minutes. Which correlates with what we are seeing: Usually after 10 minutes the exception is thrown and the Job finishes. I assume 10 minutes should be sufficient in general, right?
Hrm... are there any sort of Redis logs exposed to you that might hold some clues when it happens?
Unfortunately no sign of any issue in the logs. But it somehow seems to be related to the part where the cache is cleared running ./craft up
. We also opened a Support Ticket with Microsoft but they also didn't find any issue whatsoever.
Might help to remove some variables from the equation to try and narrow down the issue? i.e. remove Azure Kubernetes Cluster and get the sight running on a straight VM (I'm not sure what Azure calls them) + same database + redis setup and see if you get the same behavior?
We are running this kind of setup in different environments (docker, docker swarm, Kubernetes with Rancher). The only difference here is, that there is Azure's Redis involved. So it probably boils down to that in the end. Although we don't really have an explanation why. The only idea that i come up with is, that there might be some sort of race condition.
So if you also have no idea of what might be causing this, i guess there is nothing we can really do right now.
Going to go ahead and close this as stale, but comment back OP if anything came from it, and we can re-open as necessary.
What happened?
Description
We are running Craft CMS in an Azure Kubernetes Cluster. To maintain sessions we use an instance of Azure's Redis. Now once we trigger a new deployment via our pipeline, a companion Pod starts as a Job to run
./craft up
.At first everything was working fine but after of some time running the application in production without any issues we noticed that
./craft up
runs very long and then throws an exception (see below). This somehow indicates that it is also might be connected to the amount of data. It seems that the lock forcraft-up
is not being released. The Redis metrics and logs show no sign of issues.As another consequence the failing task has left our volume corrupted which caused a compiled template to be created over and over again because it was not possible to delete the file anymore due to a stale delete flag. The only way to fix this was to create a new mount.
Do you have any idea what might cause this? As a result we have now switched from using
./craft up
to executing each step individually. This seems to work fine. Our configuration for Redis is like described in the article on how to run Craft CMS in load balanced environments.Steps to reproduce
Unfortunately we don't really know how to make this reproducible
Expected behavior
./craft up
executes and releases the lock not causing a downtimeActual behavior
./craft up
blocks and causes a downtime on every re-deploy.Craft CMS version
3.7.48
PHP version
8.0.20
Operating system and version
No response
Database type and version
MariaDB 10.3
Image driver and version
No response
Installed plugins and versions