Closed ktsakalozos closed 4 years ago
In this run (revision 1224) the apiserver fell over using too much memory, see the top output.
op - 11:52:27 up 2:35, 6 users, load average: 209.30, 200.36, 175.74
Tasks: 233 total, 8 running, 176 sleeping, 0 stopped, 1 zombie
%Cpu(s): 4.3 us, 95.4 sy, 0.0 ni, 0.0 id, 0.2 wa, 0.0 hi, 0.0 si, 0.1 st
KiB Mem : 99.8/16424260 [ ]
KiB Swap: 0.0/0 [ ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27609 root 20 0 10.883g 9.253g 0 S 5.9 59.1 7:34.26 /snap/microk8s/1224/kube-apiserver --cert-dir=/var/snap/microk8s/1224/certs --service-cluster-ip-range=10.152.183.0/24 --authorization-mode=AlwaysAllow --basic-auth-file=/var/snap/microk8s/1224/credent+
30368 ubuntu 20 0 5830132 2.348g 0 S 29.7 15.0 4:30.34 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.head+
6691 ubuntu 20 0 4792136 1.296g 0 S 17.3 8.3 4:41.96 /opt/jdk-11.0.1/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encodi+
23899 ubuntu 20 0 2347996 1.143g 0 D 9.2 7.3 4:33.22 /usr/share/kibana/bin/../node/bin/node --no-warnings /usr/share/kibana/bin/../src/cli --cpu.cgroup.path.override=/ --cpuacct.cgroup.path.override=/ --elasticsearch.url=http://elasticsearch-logging:9200+
25255 nobody 20 0 469004 273172 0 S 7.2 1.7 2:28.74 /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --storage.tsdb.retention=2d --storage.tsdb.no-lockfile --web.enable-lifecycle --web.route-prefix=/
21888 root 20 0 2470220 74916 0 S 7.2 0.5 1:44.04 /usr/bin/ruby2.3 /usr/local/bin/fluentd --no-supervisor -q
24734 root 20 0 265672 65156 0 S 10.1 0.4 1:34.83 /usr/bin/ruby2.3 /usr/local/bin/fluentd --no-supervisor -q
1302 root 20 0 1799300 52444 0 S 11.1 0.3 8:52.26 /snap/microk8s/1224/kubelet --kubeconfig=/var/snap/microk8s/1224/credentials/kubelet.config --cert-dir=/var/snap/microk8s/1224/certs --client-ca-file=/var/snap/microk8s/1224/certs/ca.crt --anonymous-au+
25910 root 20 0 1896808 49384 0 S 3.3 0.3 1:48.70 cilium-agent --config-dir=/tmp/cilium/config-map
24351 root 20 0 2686168 40028 0 S 0.7 0.2 5:00.86 /snap/microk8s/1224/bin/containerd --config /var/snap/microk8s/1224/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /+
6820 www-data 20 0 379388 39972 168 R 1.3 0.2 0:07.00 nginx: worker process
6819 www-data 20 0 379388 39968 168 R 1.0 0.2 0:06.76 nginx: worker process
6821 www-data 20 0 379388 39964 168 R 0.7 0.2 0:06.63 nginx: worker process
In this run (revision 1224) the apiserver fell over and stopped accepting CRDs. That might be a symptom and not the cause.
In this run is also with revision 1224.
This is something we see often database lock:
Feb 28 14:30:08 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
Feb 28 14:30:08 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:08.132505 6789 controller.go:127] OpenAPI AggregationController: action for item v1beta1.custom.metrics.k8s.io: Rate Limited Requeue.
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:49.856155 6789 trace.go:116] Trace[294739315]: "SQL Commit" (started: 2020-02-28 14:30:46.847885843 +0000 UTC m=+644.396058439) (total time: 3.00823271s):
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[294739315]: [3.00823271s] [3.007355697s] END
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:49.856186 6789 trace.go:116] Trace[25102652]: "GuaranteedUpdate etcd3: *core.Endpoints" (started: 2020-02-28 14:30:46.847652948 +0000 UTC m=+644.395825540) (total time: 3.00852049s):
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[25102652]: [3.00852049s] [3.00829551s] END
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: E0228 14:30:49.856227 6789 status.go:71] apiserver received an error that is not an metav1.Status: protocol.Error{Code:5, Message:"database is locked"}
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:49.856418 6789 trace.go:116] Trace[69458220]: "Update" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:kube-scheduler/v1.17.3 (linux/amd64) kubernetes/ba1468d/leader-e
lection,client:127.0.0.1 (started: 2020-02-28 14:30:46.847545316 +0000 UTC m=+644.395717907) (total time: 3.00885046s):
Feb 28 14:30:49 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[69458220]: [3.00885046s] [3.008783483s] END
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:50.453746 6789 trace.go:116] Trace[1817270438]: "SQL Commit" (started: 2020-02-28 14:30:48.180777646 +0000 UTC m=+645.728950290) (total time: 2.272940733s):
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[1817270438]: [2.272934786s] [2.257838015s] op key /registry/masterleases/172.31.29.124 op 2
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:50.453803 6789 trace.go:116] Trace[1610631168]: "GuaranteedUpdate etcd3: *v1.Endpoints" (started: 2020-02-28 14:30:48.171034105 +0000 UTC m=+645.719206760) (total time: 2.282755749s):
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[1610631168]: [2.282735349s] [2.272998623s] Transaction committed
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:50.464375 6789 trace.go:116] Trace[1191678931]: "SQL Commit" (started: 2020-02-28 14:30:47.83715896 +0000 UTC m=+645.385331543) (total time: 2.627176352s):
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[1191678931]: [2.627170732s] [2.626375994s] op key /registry/services/endpoints/kube-system/kube-controller-manager op 2
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:50.464444 6789 trace.go:116] Trace[185374398]: "GuaranteedUpdate etcd3: *core.Endpoints" (started: 2020-02-28 14:30:47.836918235 +0000 UTC m=+645.385090827) (total time: 2.627510605s):
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[185374398]: [2.627477223s] [2.627242442s] Transaction committed
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: I0228 14:30:50.464527 6789 trace.go:116] Trace[469694742]: "Update" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:kube-controller-manager/v1.17.3 (linux/amd64) kubernet
es/ba1468d/leader-election,client:127.0.0.1 (started: 2020-02-28 14:30:47.836787033 +0000 UTC m=+645.384977655) (total time: 2.627700092s):
Feb 28 14:30:50 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: Trace[469694742]: [2.62765364s] [2.627588513s] Object stored in database
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: E0228 14:30:51.858730 6789 status.go:71] apiserver received an error that is not an metav1.Status: driver: bad connection
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: get current revision
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db.currentRevision
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: /root/microk8s/parts/microk8s/install/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db/kv.go:375
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db.listTx
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: /root/microk8s/parts/microk8s/install/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db/kv.go:101
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db.(*DB).List.func1
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: /root/microk8s/parts/microk8s/install/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db/kv.go:83
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db.(*DB).tx.func1
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: /root/microk8s/parts/microk8s/install/go/src/github.com/kubernetes/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db/db.go:67
Feb 28 14:30:51 ip-172-31-29-124 microk8s.daemon-apiserver[6789]: k8s.io/kubernetes/vendor/github.com/freeekanayaka/kvsql/db.retry
This time apiserver memory consamtion on VM3 was too high:
top - 14:42:07 up 5:25, 5 users, load average: 88.49, 100.34, 60.33
Tasks: 259 total, 2 running, 207 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.6 us, 54.7 sy, 0.0 ni, 0.0 id, 44.2 wa, 0.0 hi, 0.1 si, 0.4 st
%Node0 : 0.6 us, 54.7 sy, 0.0 ni, 0.0 id, 44.2 wa, 0.0 hi, 0.2 si, 0.4 st
KiB Mem : 16424260 total, 154636 free, 16101404 used, 168220 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 70972 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4091 root 20 0 13.016g 0.011t 0 S 20.9 73.1 4:37.18 /snap/microk8s/1224/kube-apiserver --cert-dir=/var/snap/microk8s/1224/certs --service-cluster-ip-range=10.152.183.0/24 --authorization-mode=AlwaysAllow --basic-auth-file=/var/snap/microk8s/1224/credent+
13506 ubuntu 20 0 4800036 1.293g 0 S 0.6 8.3 1:52.00 /opt/jdk-11.0.1/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encodi+
27051 ubuntu 20 0 1908576 817140 0 D 0.3 5.0 1:03.64 /usr/share/kibana/bin/../node/bin/node --no-warnings /usr/share/kibana/bin/../src/cli --cpu.cgroup.path.override=/ --cpuacct.cgroup.path.override=/ --elasticsearch.url=http://elasticsearch-logging:9200+
28458 nobody 20 0 412368 364364 0 S 1.5 2.2 0:37.01 /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus --storage.tsdb.retention=2d --storage.tsdb.no-lockfile --web.enable-lifecycle --web.route-prefix=/
29449 root 20 0 2628552 154240 0 S 0.6 0.9 0:21.18 java -Djava.security.egd=file:/dev/./urandom -cp . org.springframework.boot.loader.PropertiesLauncher
25620 root 20 0 2599324 76096 0 S 0.6 0.5 0:10.12 /usr/bin/ruby2.3 /usr/local/bin/fluentd --no-supervisor -q
27922 root 20 0 256456 54048 0 S 0.6 0.3 0:05.45 /usr/bin/ruby2.3 /usr/local/bin/fluentd --no-supervisor -q
8466 root 20 0 2167768 43468 0 S 15.3 0.3 4:19.13 /snap/microk8s/1224/kubelet --kubeconfig=/var/snap/microk8s/1224/credentials/kubelet.config --cert-dir=/var/snap/microk8s/1224/certs --client-ca-file=/var/snap/microk8s/1224/certs/ca.crt --anonymous-au+
14073 www-data 20 0 379452 40216 472 S 0.0 0.2 0:00.63 nginx: worker process
14
This is addressed
@freeekanayaka here are a few data points we can use when we start working on the issues we found with MicroK8s+dqlite.
The setup:
sudo snap install microk8s --calssic --channel=edge/test-dqlite
microk8s.add-node
andmicrok8s.join
sudo microk8s.enable registry dns dashboard prometheus jaeger fluentd knative
thenmicrok8s.reset
. Usually 1-2 repetitions is enough get the cluster in a bad state.The log tarballs are produced with
microk8s.inspect
. The VM1 is the one we use as the client.@tvansteenburgh @stgraber addressing this issue is important for this integration.