elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.58k stars 702 forks source link

Removing nodes that can't join the cluster #7905

Open barkbay opened 3 months ago

barkbay commented 3 months ago

It is not possible for the operator to remove nodes which either:

The reason is that the operator must first retrieve the node id in order to call the shutdown API:

    for _, node := range leavingNodes {
        nodeID, err := ns.lookupNodeID(node)
        if err != nil {
            return err
        }

But if the node cannot join the cluster, it's then not possible to get that id, and consequently it is not possible for the operator to use the shutdown API.

A symptom for that issue is:

node xxxx-es-xxxx-0 currently not member of the cluster

This situation was already discussed in this thread, but we never concluded what was the most appropriate behaviour in that case.

How to solve this?

Data integrity should be one of the operator top priority, therefore we should skip the shutdown API and remove the node only and only if we are confident in the fact that this will not result in data loss.

There are few situations where I think this is what can be done:

An alternative would be to improve the shutdown API so we can use the node external id: https://github.com/elastic/elasticsearch/issues/88222

Workaround

In the meantime I think the only workaround is to manually and gradually downscale the nodeSet, by reducing the underlying StatefulSet size until the operator can recover. Note that this solution is not ideal as we want the StatefulSets to be an implementation detail, not being directly handled by the user.

pebrc commented 3 months ago

@barkbay I agree with all of your points above. While browsing our documentation recently I found these instructions:

https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-common-problems.html#k8s-common-problems-scale-down

They are currently targeted to one specific problem but the workaround goes into the same direction as a potential workaround for the problem discussed here would go. I wonder if we should add similar instructions for this problem?