Open srteam2020 opened 2 years ago
Thanks for raising this (and thanks for the awesome work on sieve)! I'm happy that you're looking into this sort of bugs.
Relates https://github.com/elastic/cloud-on-k8s/issues/4956 (general stale read problems in controllers), https://github.com/elastic/cloud-on-k8s/issues/4950 (test the operator with sieve) and https://github.com/kubernetes/kubernetes/issues/59848 (upstream k8s issue).
In terms of priority, this is likely less dramatic than it seems:
But can still happen hence deserves to be fixed.
Disabling the cache seems to be a rather extreme option? An option that would seem more reasonable to me would be to ensure the initial List (before Watch) is populated from a quorum read rather than using the apiserver cache; but not sure how easy that is to set up.
Hello @sebgl Thanks for the reply.
Yes disabling the cache seems to be a extreme option. We are also looking for how to enforce the initial list populated from a quorum but have not found a simple solution yet.
Is it specific to this kubernetes version or you just did not have time/ressources to test on other configurations ?
The problem is not specific to this k8s version because the "time-travel" problem affects multiple k8s versions. And from the most recent response in the original issue, a controller can still go back in time after a restart in the most recent version. Our default testing environment is k8s 1.18.9, but we can definitely test other versions of k8s.
Bug Report
We find that the elastic operator can mistakenly scale down the elasticsearch statefulset which leads to unexpected node shutdown and data migration when reading stale data.
What did you do? We ran a simple workload with the elastic operator that:
elasticsearch
with 2 replicaselasticsearch
to 1 replica by setting spec.nodeSets[0].count to 1elasticsearch
back to 2 replicas by setting spec.nodeSets[0].count to 2What did you expect to see? After scaling up, the
elasticsearch
cluster should have 2 replicas running stably.What did you see instead? Under which circumstances? We find that after the operator scales up the
elasticsearch
back to 2 replicas, one of the elasticsearch pod is accidentally deleted by the operator after an operator restart. The unexpected pod deletion is caused by the operator seeing a stale cluster state from an apiserver after its restart. More concretely, after the operator restarts (due to a node crash), it connects to a different apiserver than before the restart. The newly connected one has a stale view where theelasticsearch
's spec.nodeSets[0].count field is still 1. Thus, the operator sets the statefulset's spec.replicas to 1 inBuildStatefulSet
, which further leads to (1) elasticsearch node shutdown and data migration inReconcileShutdowns
and (2) the deletion of the second elasticsearch pod hosted by the statefulset.Environment
ECK version:
660bc92fbfca469af552a833d8f6a4834c629649
Kubernetes information:
insert any information about your Kubernetes environment that could help us:
Logs: We have seen the following logs showing that the operator shuts down the elasticsearch node and scales down the statefulset:
Additional information: We have not found a easy way to fix this staleness problem. Instead, we find that the
ClientDisableCacheFor
https://github.com/kubernetes-sigs/controller-runtime/blob/f236f0345ad2933912ebf34bfcf0f93620769654/pkg/manager/manager.go#L244 from controller-runtime might help as it can disable caching for a certain type of resources. However, eliminating staleness comes with a cost because everytime the controller has to query Kubernetes and etcd when issuing aGet
orList
, which can introduce extra performance overhead.