mysql-k8s pod gets recreated on `update-status` (under the heavy load): juju.worker.caasunitterminationworker terminating due to SIGTERM

taurus-forever commented 1 year ago

Steps to reproduce

1) Deploy mysql-k8s + mysql-router-k8s on GKE using https://discourse.charmhub.io/t/charmed-mysql-k8s-how-to-deploy-on-gke/10875 2) Run performance testing using https://discourse.charmhub.io/t/charmed-mysql-k8s-how-to-performance-test/11073

Expected behavior

The performance test must finish successfully.

Actual behavior

The test fails with an error:

FATAL: mysql_drv_query() returned error 2013 (Lost connection to MySQL server during query) for query 'INSERT INTO ..."

Literally MySQL K8s Primary has gone away. K8s says the pod has been recreated (not-even restarted):

> kubectl get pods -n mysql-k8s
NAME                             READY   STATUS    RESTARTS   AGE   IP 
mysql-k8s-0                      2/2     Running   0          11h   10.104.2.10
mysql-k8s-1                      2/2     Running   0          28m   10.104.1.12 <<<<<<<<<<<< 28m, all units deployed simultaneously!
mysql-k8s-2                      2/2     Running   0          11h   10.104.0.4

> kubectl describe pod mysql-k8s-1 -n mysql-k8s
...
Containers:
...
  mysql:
...
    State:          Running
      Started:      Mon, 10 Jul 2023 23:44:45 +0200
    Ready:          True
    Restart Count:  0  <<<<<<<<<<<<<<<<<<

Juju debug logs shows a try to run update-status on the MySQL Primary node.

Versions

Operating system: 22.04 Juju CLI: 2.9.44 Juju agent: 2.9.44 Charm revision: 8.0/edge microk8s: Tested on GKE and microk8s v1.27.2 5372 1.27/stable canonical✓ classic

Log output

Juju debug log:

[...skipped non-mysql-k8s-0 log lines...]
unit-mysql-k8s-0: 12:39:14 DEBUG unit.mysql-k8s/0.juju-log ops 2.4.1 up and running.
unit-mysql-k8s-0: 12:39:14 DEBUG unit.mysql-k8s/0.juju-log load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
unit-mysql-k8s-0: 12:39:15 DEBUG unit.mysql-k8s/0.juju-log load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
unit-mysql-k8s-0: 12:39:15 DEBUG unit.mysql-k8s/0.juju-log Invalid Loki alert rules folder at /var/lib/juju/agents/unit-mysql-k8s-0/charm/src/loki_alert_rules: directory does not exist
unit-mysql-k8s-0: 12:39:15 DEBUG unit.mysql-k8s/0.juju-log Emitting Juju event update_status.
unit-mysql-k8s-0: 12:39:15 INFO unit.mysql-k8s/0.juju-log Unit workload member-state is online with member-role primary
unit-mysql-k8s-0: 12:39:17 DEBUG unit.mysql-k8s/0.juju-log Getting cluster primary member's address from mysql-k8s-0.mysql-k8s-endpoints
unit-mysql-k8s-0: 12:39:19 DEBUG unit.mysql-k8s/0.juju-log Updating pod endpoint for mysql-k8s/0
[...skipped non-mysql-k8s-0 log lines...]
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log Labeling pod mysql-k8s-0 with label primary
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log connect_tcp.started host='10.108.0.1' port=443 local_address=None timeout=None socket_options=None
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f67d0761f30>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log start_tls.started ssl_context=<ssl.SSLContext object at 0x7f67d06fed40> server_hostname='10.108.0.1' timeout=None
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7f67d0762080>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log send_request_headers.started request=<Request [b'GET']>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log send_request_headers.complete
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log send_request_body.started request=<Request [b'GET']>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log send_request_body.complete
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log receive_response_headers.started request=<Request [b'GET']>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Audit-Id', b'b4525a66-e88b-4136-8a34-9c8e8d0c86ad'), (b'Cache-Control', b'no-cache, private'), (b'Content-Type', b'application/json'), (b'X-Kubernetes-Pf-Flowschema-Uid', b'cadc6ce1-e5ad-4e8f-b9e7-62ca7fa65d1d'), (b'X-Kubernetes-Pf-Prioritylevel-Uid', b'74cb042e-3c8a-4780-9279-2a727532262b'), (b'Date', b'Mon, 10 Jul 2023 10:39:24 GMT'), (b'Transfer-Encoding', b'chunked')])
unit-mysql-k8s-0: 12:39:24 INFO unit.mysql-k8s/0.juju-log HTTP Request: GET https://10.108.0.1/api/v1/namespaces/mysql-k8s/pods/mysql-k8s-0 "HTTP/1.1 200 OK"
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log receive_response_body.started request=<Request [b'GET']>
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log receive_response_body.complete
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log response_closed.started
unit-mysql-k8s-0: 12:39:24 DEBUG unit.mysql-k8s/0.juju-log response_closed.complete
unit-mysql-k8s-0: 12:39:24 WARNING unit.mysql-k8s/0.juju-log No relation: certificates
unit-mysql-k8s-0: 12:39:25 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
[...skipped non-mysql-k8s-0 log lines...]
unit-mysql-k8s-0: 12:40:21 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
unit-mysql-k8s-0: 12:40:25 DEBUG unit.mysql-k8s/0.juju-log ops 2.4.1 up and running.
unit-mysql-k8s-0: 12:40:25 DEBUG unit.mysql-k8s/0.juju-log load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
unit-mysql-k8s-0: 12:40:25 DEBUG unit.mysql-k8s/0.juju-log load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
unit-mysql-k8s-0: 12:40:25 DEBUG unit.mysql-k8s/0.juju-log Invalid Loki alert rules folder at /var/lib/juju/agents/unit-mysql-k8s-0/charm/src/loki_alert_rules: directory does not exist
unit-mysql-k8s-0: 12:40:25 DEBUG unit.mysql-k8s/0.juju-log Emitting Juju event stop.
unit-mysql-k8s-0: 12:40:26 INFO juju.worker.uniter.operation ran "stop" hook (via hook dispatching script: dispatch)
unit-mysql-k8s-0: 12:40:26 INFO juju.worker.uniter unit "mysql-k8s/0" shutting down: agent should be terminated

As you can see above, the caasunitterminationworker generates SIGTERM after about 1 minute of waiting after "update-status" ran. It looks like we have some internal timer/timeout there.

The normal update-status case scenario (without performance test):

unit-mysql-k8s-0: 23:46:32 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
...
unit-mysql-k8s-0: 23:46:36 DEBUG unit.mysql-k8s/0.juju-log ops 2.4.1 up and running.
unit-mysql-k8s-0: 23:46:36 DEBUG unit.mysql-k8s/0.juju-log load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
unit-mysql-k8s-0: 23:46:36 DEBUG unit.mysql-k8s/0.juju-log load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
unit-mysql-k8s-0: 23:46:36 DEBUG unit.mysql-k8s/0.juju-log Invalid Loki alert rules folder at /var/lib/juju/agents/unit-mysql-k8s-0/charm/src/loki_alert_rules: directory does not exist
...
unit-mysql-k8s-0: 23:46:36 DEBUG unit.mysql-k8s/0.juju-log Emitting Juju event update_status.
unit-mysql-k8s-0: 23:46:37 INFO unit.mysql-k8s/0.juju-log Alex Lutay: return doing nothing

As you can see, it took 4 seconds to do nothing (I have muted update_status charm block). The line "ops 2.4.1 up and running" is missing in the logs above. Did ops fail to launch due to high unit/k8s/mysql load? 0_o

Additional context

It looks very similar to https://github.com/canonical/prometheus-k8s-operator/issues/426 which came to nowhere... Note we are using 2.9.44 with stop handling fixes, so https://bugs.launchpad.net/juju/+bug/1951415 is no longer a problem here!

The entire debug log with Juju 2,9 is here, see line 695 (SIGTERM time). The entire debug log with Juju 3.2 is here, see line 493 (SIGTERM time). Good ideas are welcome!

github-actions[bot] commented 1 year ago

https://warthogs.atlassian.net/browse/DPE-2225

taurus-forever commented 1 year ago

The valuable hint came from John in MM:

So the "terminating due to SIGTERM" is because we got a SIGTERM (I believe) not because juju is initiating that request.
I believe that means that K8s is telling us we need to be torn down (either because it is rescheduling, or a health check failed, or something else).

The last second mysql-error log:

2023-07-12T11:09:58.830318Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.32-0ubuntu0.22.04.2).
2023-07-12T11:10:02.387287Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
ERROR "command terminated with exit code 137"

The Events on K8s says pod get killed due to lack of the RAM:

> kubectl get events -w
LAST SEEN   TYPE      REASON                      OBJECT                                             MESSAGE
14s         Warning   EvictionThresholdMet        node/gke-taurus-20152-default-pool-dd74c68c-7cq0   Attempting to reclaim memory
12s         Normal    NodeHasInsufficientMemory   node/gke-taurus-20152-default-pool-dd74c68c-7cq0   Node gke-taurus-20152-default-pool-dd74c68c-7cq0 status is now: NodeHasInsufficientMemory

The node had 16Gb RAM but limits usability to 12Gb only:

> kubectl describe node gke-taurus-20152-default-pool-dd74c68c-7cq0
...
Capacity:
...
  memory:             15358168Ki
Allocatable:
...
  memory:             12658904Ki

Our default charm code uses 75% RAM (Capacity) for innodb_buffer_pull_size = 11Gb which is close to 12Gb allowed by node (Allocatable), also node has another services running => BOOM:

juju ssh --container mysql mysql-k8s/0 bash
root@mysql-k8s-0:/# cat /etc/mysql/mysql.conf.d/z-custom.cnf 
[mysqld]
...
innodb_buffer_pool_size = 11811160064

As a result MySQL went above the limits and is being killed:

root@mysql-k8s-0:/# top

top - 11:10:03 up 18:47,  0 users,  load average: 5.22, 6.06, 4.43
Tasks:  10 total,   1 running,   9 sleeping,   0 stopped,   0 zombie
%Cpu0  :   0.0/0.0     0[                                                                                                    ]
%Cpu1  :   7.1/0.0     7[|||||||                                                                                             ]
%Cpu2  :   0.0/0.0     0[                                                                                                    ]
%Cpu3  :   0.0/7.1     7[|||||||                                                                                             ]
MiB Mem :  14998.2 total,    361.0 free,  12576.1 used,   2061.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2049.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                         
    234 mysql     20   0   15.9g  11.5g  29560 S   7.1  78.3  34:13.28 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --log-error=/var/log/mysql/error.log --pid-file=mysql-k8s-0.pid                     
   1584 root      20   0    7384   3284   2680 R   7.1   0.0   0:00.36 top

TODO: count innodb_buffer_pull_size as 75% of Allocatable RAM on K8s.

taurus-forever commented 1 year ago

TL;DR: @paulomach , please commit group_replication_message_cache_size=128Mb for profile=testing.

As for profile=production, decrease the value of innodb_buffer_pool_size, after calculating it as 75% of Allocatable RAM. Decrease it on the amount of the current group_replication_message_cache_size (1Gb).

Long: MySQL InnoDB Group replication has an internal XCom cache, read more here: https://dev.mysql.com/doc/refman/8.0/en/group-replication-performance-xcom-cache-reduce.html

What we saw in standard deployment: K8s node has 16Gb, Allocatable=12Gb, InnoDB buffer got 11G (75% of 16Gb), XCom cache reached 1G and node killed the pod, due to grow above "Allocatable RAM".

In constrained deployment: Pod get limit 5G, InnoDB buffer got 4G (75% RAM), XCom cache reached 1G and mysqld killed byt OOM:

4m4s        Warning   OOMKilling         node/gke-taurus-20152-default-pool-dd74c68c-wp6s   Memory cgroup out of memory: Killed process 68365 (mysqld) total-vm:8573716kB, anon-rss:5206312kB, file-rss:41072kB, shmem-rss:0kB, UID:584788 pgtables:10880kB oom_score_adj:659

We should consider to tune XCom cache properly for K8s in the future, but for now, let's keep default values, but decrease InnoDB cache size for XCom amount (to keep the free RAM for grow). Tnx!

taurus-forever commented 1 year ago

2 @paulomach the testing on GKE shows a way better stability results, but still failed due to killed model-controller (separate/next topic). Please finish draft to the merge-ready state and lest merge it in edge to continue testing. Tnx!

canonical / mysql-k8s-operator