canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

doc: cluster-healing clarifications #13374

Open iatrou opened 5 months ago

iatrou commented 5 months ago

https://documentation.ubuntu.com/lxd/en/latest/api-extensions/#cluster-healing

Please clarify what the "threshold" unit is. Seconds? Something else? Furthermore, the default value is 0, disabled. Please provide a recommended value when it's enabled. If there are values that are known to be problematic (e.g. causing false positives) please call it out.


Document: api-extensions.md

github-actions[bot] commented 5 months ago

Heads up @ru-fu - the "Documentation" label was applied to this issue.

ru-fu commented 5 months ago

https://documentation.ubuntu.com/lxd/en/latest/api-extensions/#cluster-healing

Please clarify what the "threshold" unit is. Seconds? Something else?

It's seconds - see the documentation for the configuration option (which is linked): https://documentation.ubuntu.com/lxd/en/latest/server/#server-cluster:cluster.healing_threshold

There's also some more information here.

Furthermore, the default value is 0, disabled. Please provide a recommended value when it's enabled. If there are values that are known to be problematic (e.g. causing false positives) please call it out.

Not sure if we can give recommendations, since it depends on the individual setup, size of the cluster, network speed ... Any ideas @tomponline ? Or pointers to who can give more input?

tomponline commented 5 months ago

Thanks @ru-fu

@iatrou @ru-fu is correct that the cluster.healing_threshold is environment & workload specific.

I would not suggest setting it too low as any short live issues in the network, or short lived high load on a single host that may cause it to stop responding to heartbeats in a timely manner could trigger the cluster healing for that mechanism.

So my recommendation is to set it as high as possible that the organisation's availability targets are met whilst avoiding the chances of short lived "events" triggering an unwanted forced evacuation of a cluster member.

I would also like to draw your attention to some known issues with cluster healing and evacuation:

https://github.com/canonical/lxd/issues/13083 - this is also a risk if the cluster healing threshold is set too low, such that any manual maintenance windows that involve rebooting a cluster member cleaning can result in the member being considered "offline" and being evacuated, meaning that its workloads are started up elsewhere and when the member is back up again, it will have to be manually restored to service using lxc cluster restore which will then move the workloads back (with disruption to the workloads).

https://github.com/canonical/lxd/issues/12526 - this issue relates to both cluster healing and manual workload recovery (using lxc move) when a cluster member becomes partitioned from the rest of the cluster at the network level, but is still running. The problem is that the workloads on the original cluster member are not terminated when the network is partitioned, and can cluster healing can then cause the workload to be moved to another member and started up, whilst the original workload is still running.

The original workload is most likely hung because the network partition would also likely block access to the shared storage pool. Things become more complicated when considering disaggregated microcloud setups where the ceph traffic is going over a different interface than the cluster management traffic, and so it is possible for there to be a cluster network partition and both cluster members still having access to the underlying shared storage system.

In these cases, in my initial experimentation, ceph should still prevent the workload's volume from being active on both workloads at the same time, but if this does happen for some reason then we can expect disk corruption for both workloads.

Further work is required in LXD to look into whether we can use stricter ceph locks on workload volumes to be certain to avoid this, and whether we can add logic such that the partitioned member kills its active workloads.