[BUG] Elastic operator mistakenly issues statefulset scaledown leading to unexpected node shutdown and data migration

srteam2020 commented 2 years ago

Bug Report

We find that the elastic operator can mistakenly scale down the elasticsearch statefulset which leads to unexpected node shutdown and data migration when reading stale data.

What did you do? We ran a simple workload with the elastic operator that:

create an elasticsearch cluster elasticsearch with 2 replicas
scale down elasticsearch to 1 replica by setting spec.nodeSets[0].count to 1
scale up elasticsearch back to 2 replicas by setting spec.nodeSets[0].count to 2

What did you expect to see? After scaling up, the elasticsearch cluster should have 2 replicas running stably.

What did you see instead? Under which circumstances? We find that after the operator scales up the elasticsearch back to 2 replicas, one of the elasticsearch pod is accidentally deleted by the operator after an operator restart. The unexpected pod deletion is caused by the operator seeing a stale cluster state from an apiserver after its restart. More concretely, after the operator restarts (due to a node crash), it connects to a different apiserver than before the restart. The newly connected one has a stale view where the elasticsearch's spec.nodeSets[0].count field is still 1. Thus, the operator sets the statefulset's spec.replicas to 1 in BuildStatefulSet, which further leads to (1) elasticsearch node shutdown and data migration in ReconcileShutdowns and (2) the deletion of the second elasticsearch pod hosted by the statefulset.

Environment

ECK version:

660bc92fbfca469af552a833d8f6a4834c629649
Kubernetes information:

insert any information about your Kubernetes environment that could help us:
- On premise: v1.18.9

Logs: We have seen the following logs showing that the operator shuts down the elasticsearch node and scales down the statefulset:

{"log.level":"info","@timestamp":"2022-01-21T16:37:51.187Z","log.logger":"driver","message":"Node shutdown complete, can start node deletion","service.version":"2.0.0-SNAPSHOT+9b08d4a5","service.type":"eck","ecs.version":"1.4.0","namespace":"default","es_name":"elasticsearch-cluster","type":"REMOVE","node":"elasticsearch-cluster-es-default-1"}
{"log.level":"info","@timestamp":"2022-01-21T16:37:51.187Z","log.logger":"driver","message":"Scaling replicas down","service.version":"2.0.0-SNAPSHOT+9b08d4a5","service.type":"eck","ecs.version":"1.4.0","namespace":"default","statefulset_name":"elasticsearch-cluster-es-default","from":2,"to":1}

Additional information: We have not found a easy way to fix this staleness problem. Instead, we find that the ClientDisableCacheFor https://github.com/kubernetes-sigs/controller-runtime/blob/f236f0345ad2933912ebf34bfcf0f93620769654/pkg/manager/manager.go#L244 from controller-runtime might help as it can disable caching for a certain type of resources. However, eliminating staleness comes with a cost because everytime the controller has to query Kubernetes and etcd when issuing a Get or List, which can introduce extra performance overhead.

sebgl commented 2 years ago

Thanks for raising this (and thanks for the awesome work on sieve)! I'm happy that you're looking into this sort of bugs.

Relates https://github.com/elastic/cloud-on-k8s/issues/4956 (general stale read problems in controllers), https://github.com/elastic/cloud-on-k8s/issues/4950 (test the operator with sieve) and https://github.com/kubernetes/kubernetes/issues/59848 (upstream k8s issue).

In terms of priority, this is likely less dramatic than it seems:

requires some particular conditions (operator restart + the apiserver it connects back to lags behind significantly)
there should be no data loss as the operator will still ensure a Pod does not get deleted before its data has been migrated away

But can still happen hence deserves to be fixed.

Disabling the cache seems to be a rather extreme option? An option that would seem more reasonable to me would be to ensure the initial List (before Watch) is populated from a quorum read rather than using the apiserver cache; but not sure how easy that is to set up.

srteam2020 commented 2 years ago

Hello @sebgl Thanks for the reply.

Yes disabling the cache seems to be a extreme option. We are also looking for how to enforce the initial list populated from a quorum but have not found a simple solution yet.

julienlau commented 2 years ago

Is it specific to this kubernetes version or you just did not have time/ressources to test on other configurations ?

srteam2020 commented 2 years ago

The problem is not specific to this k8s version because the "time-travel" problem affects multiple k8s versions. And from the most recent response in the original issue, a controller can still go back in time after a restart in the most recent version. Our default testing environment is k8s 1.18.9, but we can definitely test other versions of k8s.

elastic / cloud-on-k8s

[BUG] Elastic operator mistakenly issues statefulset scaledown leading to unexpected node shutdown and data migration #5274

Bug Report