KaaS is dysfunctional because of high latency in gx-scs

KaaS is dysfunctional because of high latency in gx-scs, which cause following problems:

KaaS V1 e2e Zuul builds failed due to high latency
- Observer and Harbor k8s clusters (both gx-scs) are in trouble as well
Related metrics dashboards (observed from Harbor cluster, from last 7 days, source https://monitoring.scs.community/):
- Control plane nodes - I/O time (ms):
- Etcd cluster - Total leader elections (per day)
- Etcd cluster - Disk WAL fsync duration (seconds)
- Etcd cluster - Disk backend commit duration (seconds)
The issue will be resolved with the “s” flavors: SCS-2V-4-20s and SCS-4V-16-100s which will be available “really soon” (based on info from Ralf (PS))

Meantime we can try to workaround our current issue as follows:

options:
1. Increase (even more) etcd heartbeat interval and election timeout, current etcd setup
  - what are the side effects?
  - !this should be the temporary solution!
2. Having a separate FS for the etcd
3. Having a separate DISK for the etcd

Until this issue is resolved, we should probably silence etcd alerts in monitoring.

Seems the problem was somehow fixed for now by PS. Take a look at the attached graphs below. For example, Disk backend commit duration (seconds) dropped from 9 seconds to a max 1.5s. Also, a significant drop applies to Total leader elections (per day).

Control plane nodes - I/O time (ms):
Etcd cluster - Total leader elections (per day)
Etcd cluster - Disk backend commit duration (seconds)

From my testing of these options listed in the issue, it turned out that increasing the heartbeat and election timeout had no effect and the pods restarted the same way all the time. Option 2 was not tested. The third option seems to have an effect on that because after moving etcd to a separate disk, the metrics for the total leader election dropped and also the restart of the pods was almost non-existent.

The procedure chosen for the third option:

spawn cluster with KaaS v1 implementation, where OpenStackMachineTemplatefor control-plane was adjusted as follows

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha7
kind: OpenStackMachineTemplate
metadata:
name: ${PREFIX}-${CLUSTER_NAME}-control-plane-${CONTROL_PLANE_MACHINE_GEN}
spec:
template:
spec:
  flavor: ${OPENSTACK_CONTROL_PLANE_MACHINE_FLAVOR}
  serverGroupID: ${OPENSTACK_SRVGRP_CONTROLLER}
  image: ${OPENSTACK_IMAGE_NAME}
  sshKeyName: ${OPENSTACK_SSH_KEY_NAME}
  cloudName: ${OPENSTACK_CLOUD}
  identityRef:
    name: ${CLUSTER_NAME}-cloud-config
    kind: Secret
  securityGroups:
    - name: ${PREFIX}-allow-ssh
    - name: ${PREFIX}-allow-icmp
    - name: ${PREFIX}-${CLUSTER_NAME}-cilium
  additionalBlockDevices:
  - name: etcd-volume
    sizeGiB: 10
    storage:
      type: Volume

check metrics of ETCD disk WAL fsync duration (seconds)

move ETCD to separate disk as follows:

ssh ubuntu@<control-plane-node-public-ip> -i terraform/.deploy.id_rsa.gx-scs 
sudo mkfs.ext4 /dev/sdb  # format the disk 
sudo mkdir /mnt/etcd-data  # create a mount point
sudo mount /dev/sdb /mnt/etcd-data  # mount the disk
echo '/dev/sdb /mnt/etcd-data ext4 defaults 0 0' | sudo tee -a /etc/fstab  # ensure the mount is persistent across reboots
sudo systemctl stop kubelet
sudo cp -aR /var/lib/etcd/* /mnt/etcd-data/

change /var/lib/etcd/ to /mnt/etcd-data/ in etcd configuration(adjust it in /etc/kubernetes/manifests/etcd.yaml)

start kubelet
```
sudo systemctl start kubelet
```
check check metrics of ETCD disk WAL fsync duration (should be lower in few minutes)

Conclusion:

ETCD uses raft. The standard raft implementation is a pure heartbeat based protocol. If any follower thinks the leader is dead due to the heartbeat timeout, it can start an election. So slow fsync can block one follower long enough to make it think the leader is dead.

SovereignCloudStack / issues

KaaS is dysfunctional because of high latency in gx-scs #588

The procedure chosen for the third option:

Conclusion: