Closed vmpjdc closed 2 days ago
Based on the log, it appears that the resource-centre
plugin is causing a deadlock in the database. Could you please forward this information to the maintainer of resource-centre
for their input as well?
I believe the plugin is owned by the web and design team, and I've started a thread here: https://chat.canonical.com/canonical/pl/k7quc4dqipnhjgstu5xkz1mmty
I think I've taken this as far as I reasonably can: IS doesn't own the deployment, or the site being deployed, and the cloud and the k8s cluster, which we do own, appear to be working correctly.
I've silenced this for a week in AlertManager so there's no rush to progress this from my perspective.
Thanks, we will follow up with the web team if this happens again.
FYI, we have this alert constantly.
We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).
It happened twice today, with similar log to what was described in the original bug.
I can see the liveness check is rather "aggressive":
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Indeed, the charm pebble layer check 0 results in this
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
"failureThreshold": 1,
"httpGet": {
"path": "/v1/health?level=alive",
"port": 38812,
"scheme": "HTTP"
},
"initialDelaySeconds": 30,
"periodSeconds": 5,
"successThreshold": 1,
"timeoutSeconds": 1
}
{
"failureThreshold": 1,
"httpGet": {
"path": "/v1/health?level=alive",
"port": 38813,
"scheme": "HTTP"
},
"initialDelaySeconds": 30,
"periodSeconds": 5,
"successThreshold": 1,
"timeoutSeconds": 1
}
This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.
Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.
I'm going to silence till Friday so you have time to look at it.
FYI, we have this alert constantly.
We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).
It happened twice today, with similar log to what was described in the original bug.
I can see the liveness check is rather "aggressive":
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1 Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Indeed, the charm pebble layer check 0 results in this
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe' { "failureThreshold": 1, "httpGet": { "path": "/v1/health?level=alive", "port": 38812, "scheme": "HTTP" }, "initialDelaySeconds": 30, "periodSeconds": 5, "successThreshold": 1, "timeoutSeconds": 1 } { "failureThreshold": 1, "httpGet": { "path": "/v1/health?level=alive", "port": 38813, "scheme": "HTTP" }, "initialDelaySeconds": 30, "periodSeconds": 5, "successThreshold": 1, "timeoutSeconds": 1 }
This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.
Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.
I'm going to silence till Friday so you have time to look at it.
The health checks in the k8s charms are controlled by Pebble, and the check parameters on the Kubernetes side are actually for the Pebble server. Therefore, the small failed threshold and timeout seconds are meant for Pebble health API requests, instead of WordPress health check requests. The actual health check parameters for WordPress are defined as you mentioned here, with a timeout of 5 seconds and a failure threshold of 3 (default).
Do you have any monitoring information that you can share with us? For example, there's a request duration Prometheus metric which can indicate if the WordPress server is running slowly, and perhaps any WordPress Apache logs related to the failure in Loki?
Here is an extract of a failure that happened today and the relevant apache logs around it:
@weiiwang01 can you follow-up on this and/or close the issue please?
i believe this has already been addressed in higher revisions of the wordpress-k8s
charm by this pull request, which adds configurable timeout values: https://github.com/canonical/wordpress-k8s-operator/pull/239.
i will close this for now; please reopen the issue if there are other problems after the upgrade.
Bug Description
We get frequent alerts due to the wordpress container restarting. This was addressed by https://github.com/canonical/wordpress-k8s-operator/pull/135 and an upgrade to r46 of the charm, but either this didn't solve the problem or a new one has arisen.
To Reproduce
Deploy the charm.
Environment
prod-is-external-kubernetes@is-bastion-ps5
Relevant log output
Additional context
No response