canonical / wordpress-k8s-operator

wordpress-k8s-operator - charm repository.
Apache License 2.0
13 stars 7 forks source link

wordpress container restarts #138

Closed vmpjdc closed 2 days ago

vmpjdc commented 1 year ago

Bug Description

We get frequent alerts due to the wordpress container restarting. This was addressed by https://github.com/canonical/wordpress-k8s-operator/pull/135 and an upgrade to r46 of the charm, but either this didn't solve the problem or a new one has arisen.

To Reproduce

Deploy the charm.

Environment

prod-is-external-kubernetes@is-bastion-ps5

Relevant log output

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-0 wordpress -p | tail -n 20
2023-09-27T01:21:43.634Z [wordpress] 10.86.80.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:43.916Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.054Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:42 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:44.915Z [wordpress] 10.86.93.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.652Z [wordpress] 10.86.64.1 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.828Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:43 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:45.953Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:46.165Z [wordpress] 10.86.11.0 - - [27/Sep/2023:01:21:44 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:47.107Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:47 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T01:21:47.297Z [wordpress] 10.86.73.0 - - [27/Sep/2023:01:21:45 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T01:21:50.450Z [wordpress] 127.0.0.1 - - [27/Sep/2023:01:21:50 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T01:21:51.625Z [wordpress] localhost - - [27/Sep/2023:01:21:51 +0000] "GET /server-status/?auto HTTP/1.1" 200 753 "-" "Go-http-client/1.1"
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T01:21:53.037Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757389 2023] [php7:notice] [pid 333] [client 10.86.73.0:34576] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777770', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.757Z [wordpress] [Wed Sep 27 01:21:55.757649 2023] [php7:notice] [pid 339] [client 10.86.80.0:56874] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:55.762Z [wordpress] [Wed Sep 27 01:21:55.761759 2023] [php7:notice] [pid 340] [client 10.86.80.0:56882] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_timeout_global_styles_resource-centre', '1695777771', 'no') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) made by require('wp-blog-header.php'), require_once('wp-includes/template-loader.php'), include('/themes/resource-centre/index.php'), get_header, locate_template, load_template, require_once('/themes/resource-centre/header.php'), wp_head, do_action('wp_head'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_scripts, do_action('wp_enqueue_scripts'), WP_Hook->do_action, WP_Hook->apply_filters, wp_enqueue_global_styles, wp_get_global_stylesheet, set_transient, add_option
2023-09-27T01:21:57.395Z [pebble] Exiting on terminated signal.
2023-09-27T01:21:57.399Z [pebble] Stopping all running services.
2023-09-27T01:21:57.525Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl logs -n prod-admin-insights-ubuntu-k8s wordpress-k8s-1 wordpress -p | tail -n 20
2023-09-27T00:56:09.429Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 15465 "-" "haproxy/httpchk"
2023-09-27T00:56:09.561Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:55:49 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.597Z [wordpress] localhost - - [27/Sep/2023:00:56:09 +0000] "GET /server-status/?auto HTTP/1.1" 200 765 "-" "Go-http-client/1.1"
2023-09-27T00:56:09.636Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.638Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:03 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.749Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:09.954Z [wordpress] 10.86.80.0 - - [27/Sep/2023:00:56:04 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:10.081Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:10.175Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:05 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.090Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:11 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.41 (Ubuntu) OpenSSL/1.1.1f (internal dummy connection)"
2023-09-27T00:56:11.327Z [wordpress] 10.86.64.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:11.499Z [wordpress] 127.0.0.1 - - [27/Sep/2023:00:56:10 +0000] "GET /index.php HTTP/1.1" 301 217 "-" "Go-http-client/1.1"
2023-09-27T00:56:11.573Z [wordpress] 10.86.73.0 - - [27/Sep/2023:00:56:06 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.160Z [wordpress] 10.86.11.1 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:12.507Z [wordpress] 10.86.93.0 - - [27/Sep/2023:00:56:07 +0000] "GET / HTTP/1.1" 200 23722 "-" "haproxy/httpchk"
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure 3 (threshold 3): Get "http://localhost/": context deadline exceeded
2023-09-27T00:56:13.214Z [pebble] Check "wordpress-ready" failure threshold 3 hit, triggering action
2023-09-27T00:56:13.513Z [pebble] Exiting on terminated signal.
2023-09-27T00:56:13.518Z [pebble] Stopping all running services.
2023-09-27T00:56:13.678Z [pebble] Service "wordpress" stopped
prod-is-external-kubernetes@is-bastion-ps5:~$

Additional context

No response

weiiwang01 commented 1 year ago

Based on the log, it appears that the resource-centre plugin is causing a deadlock in the database. Could you please forward this information to the maintainer of resource-centre for their input as well?

vmpjdc commented 1 year ago

I believe the plugin is owned by the web and design team, and I've started a thread here: https://chat.canonical.com/canonical/pl/k7quc4dqipnhjgstu5xkz1mmty

I think I've taken this as far as I reasonably can: IS doesn't own the deployment, or the site being deployed, and the cloud and the k8s cluster, which we do own, appear to be working correctly.

I've silenced this for a week in AlertManager so there's no rush to progress this from my perspective.

weiiwang01 commented 1 year ago

Thanks, we will follow up with the web team if this happens again.

ben-ballot commented 2 months ago

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

Indeed, the charm pebble layer check 0 results in this

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

weiiwang01 commented 2 months ago

FYI, we have this alert constantly.

We limited the alert to be fired only when a restart happens 3 times under 10 minutes (which means that the liveness probe failed enough to trigger 3 restarts).

It happened twice today, with similar log to what was described in the original bug.

I can see the liveness check is rather "aggressive":

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s describe pods wordpress-k8s-1 | grep Liveness
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Liveness:       http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1

Indeed, the charm pebble layer check 0 results in this

prod-is-external-kubernetes@is-bastion-ps5:~$ kubectl -n prod-admin-insights-ubuntu-k8s get statefulset wordpress-k8s -o json | jq '.spec | .template| .spec | .containers | .[].livenessProbe'
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38812,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
{
  "failureThreshold": 1,
  "httpGet": {
    "path": "/v1/health?level=alive",
    "port": 38813,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 30,
  "periodSeconds": 5,
  "successThreshold": 1,
  "timeoutSeconds": 1
}

This is probably a bit too aggressive. timeoutSeconds is probably better to 3 or 5 secondes, and increasing the periodSeconds accordingly seems a goot thing. So I would probably add a "timeout: 3" or "timeout: 5" and "period: 10". Or make this configurable.

Also, the "threshold" setting is misleading, it's apparently only affecting the "successThreshold" and not the "failureThreshold" (undefined above) that defaults to 3. All in all, I feel like the agressiveness of the check is probably what is causing the failures.

I'm going to silence till Friday so you have time to look at it.

The health checks in the k8s charms are controlled by Pebble, and the check parameters on the Kubernetes side are actually for the Pebble server. Therefore, the small failed threshold and timeout seconds are meant for Pebble health API requests, instead of WordPress health check requests. The actual health check parameters for WordPress are defined as you mentioned here, with a timeout of 5 seconds and a failure threshold of 3 (default).

Do you have any monitoring information that you can share with us? For example, there's a request duration Prometheus metric which can indicate if the WordPress server is running slowly, and perhaps any WordPress Apache logs related to the failure in Loki?

ben-ballot commented 2 months ago

Here is an extract of a failure that happened today and the relevant apache logs around it:

cbartz commented 5 days ago

@weiiwang01 can you follow-up on this and/or close the issue please?

weiiwang01 commented 2 days ago

i believe this has already been addressed in higher revisions of the wordpress-k8s charm by this pull request, which adds configurable timeout values: https://github.com/canonical/wordpress-k8s-operator/pull/239.

i will close this for now; please reopen the issue if there are other problems after the upgrade.