SovereignCloudStack / issues

This repository is used for issues that are cross-repository or not bound to a specific repository.
https://github.com/orgs/SovereignCloudStack/projects/6
2 stars 1 forks source link

KaaS is dysfunctional because of high latency in gx-scs #588

Open michal-gubricky opened 5 months ago

michal-gubricky commented 5 months ago

KaaS is dysfunctional because of high latency in gx-scs, which cause following problems:

Meantime we can try to workaround our current issue as follows:

Until this issue is resolved, we should probably silence etcd alerts in monitoring.

michal-gubricky commented 5 months ago

Seems the problem was somehow fixed for now by PS. Take a look at the attached graphs below. For example, Disk backend commit duration (seconds) dropped from 9 seconds to a max 1.5s. Also, a significant drop applies to Total leader elections (per day).

From my testing of these options listed in the issue, it turned out that increasing the heartbeat and election timeout had no effect and the pods restarted the same way all the time. Option 2 was not tested. The third option seems to have an effect on that because after moving etcd to a separate disk, the metrics for the total leader election dropped and also the restart of the pods was almost non-existent.

The procedure chosen for the third option: