Open michal-gubricky opened 5 months ago
Seems the problem was somehow fixed for now by PS
. Take a look at the attached graphs below. For example, Disk backend commit duration (seconds) dropped from 9 seconds to a max 1.5s. Also, a significant drop applies to Total leader elections (per day).
Control plane nodes - I/O time (ms):
Etcd cluster - Total leader elections (per day)
Etcd cluster - Disk backend commit duration (seconds)
From my testing of these options listed in the issue, it turned out that increasing the heartbeat
and election timeout
had no effect and the pods restarted the same way all the time. Option 2 was not tested. The third option seems to have an effect on that because after moving etcd
to a separate disk, the metrics for the total leader election dropped and also the restart of the pods was almost non-existent.
OpenStackMachineTemplate
for control-plane was adjusted as follows
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha7
kind: OpenStackMachineTemplate
metadata:
name: ${PREFIX}-${CLUSTER_NAME}-control-plane-${CONTROL_PLANE_MACHINE_GEN}
spec:
template:
spec:
flavor: ${OPENSTACK_CONTROL_PLANE_MACHINE_FLAVOR}
serverGroupID: ${OPENSTACK_SRVGRP_CONTROLLER}
image: ${OPENSTACK_IMAGE_NAME}
sshKeyName: ${OPENSTACK_SSH_KEY_NAME}
cloudName: ${OPENSTACK_CLOUD}
identityRef:
name: ${CLUSTER_NAME}-cloud-config
kind: Secret
securityGroups:
- name: ${PREFIX}-allow-ssh
- name: ${PREFIX}-allow-icmp
- name: ${PREFIX}-${CLUSTER_NAME}-cilium
additionalBlockDevices:
- name: etcd-volume
sizeGiB: 10
storage:
type: Volume
ssh ubuntu@<control-plane-node-public-ip> -i terraform/.deploy.id_rsa.gx-scs
sudo mkfs.ext4 /dev/sdb # format the disk
sudo mkdir /mnt/etcd-data # create a mount point
sudo mount /dev/sdb /mnt/etcd-data # mount the disk
echo '/dev/sdb /mnt/etcd-data ext4 defaults 0 0' | sudo tee -a /etc/fstab # ensure the mount is persistent across reboots
sudo systemctl stop kubelet
sudo cp -aR /var/lib/etcd/* /mnt/etcd-data/
/var/lib/etcd/
to /mnt/etcd-data/
in etcd configuration(adjust it in /etc/kubernetes/manifests/etcd.yaml)sudo systemctl start kubelet
check check metrics of ETCD disk WAL fsync duration (should be lower in few minutes)
ETCD uses raft. The standard raft implementation is a pure heartbeat based protocol. If any follower thinks the leader is dead due to the heartbeat timeout, it can start an election. So slow fsync can block one follower long enough to make it think the leader is dead.
KaaS is dysfunctional because of high latency in gx-scs, which cause following problems:
KaaS V1 e2e Zuul builds failed due to high latency
Related metrics dashboards (observed from Harbor cluster, from last 7 days, source https://monitoring.scs.community/):
The issue will be resolved with the “s” flavors: SCS-2V-4-20s and SCS-4V-16-100s which will be available “really soon” (based on info from Ralf (PS))
Meantime we can try to workaround our current issue as follows:
Until this issue is resolved, we should probably silence etcd alerts in monitoring.