coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 741 forks source link

etcd-client k8s service fails to start after etcd restore operation #2087

Open eLco opened 5 years ago

eLco commented 5 years ago

name: Bug Report labels: kind/bug

What happened:

I'm trying to restore etcd from S3 backup made by backup-operator. I'm deploying fresh empty etcd cluster using etcd-operator helm chart - https://github.com/helm/charts/tree/master/stable/etcd-operator

After that I've created EtcdRestore CR and restore-operator begins the restore operation. During that operation, restore-operator tries to cleanup pods/service from initial etcd cluster and etcd-operator itself should recreate them, but it fails to recreate ClientService service called "clusterName-client", because it still exists at that moment. etcd-operator code tries to create service only once and ignores "IsAlreadyExists" error, so it silently pass on it. https://github.com/coreos/etcd-operator/blob/master/pkg/util/k8sutil/k8sutil.go#L189

https://github.com/coreos/etcd-operator/commit/be0c3acb50a902acd73960bd61221e80f50bdcb6#diff-46acd69e36758f5f5c27664b895b2bc3

There is no more DeleteCollection operation on services and we've two of them "clusterName" and "clusterName-client", but code tries to delete only one service called by "clusterName". In case of pods it works fine, cause pods has unique names and we can create new pods with same prefix. This works fine on Kubernetes 1.13 and fails on Kubernetes 1.14, I didn't find what have changed in delete operations in Kubernetes between those releases.

What you expected to happen:

Both etcd k8s services are deployed after restore operation.

How to reproduce it (as minimally and precisely as possible):

Deploy etcd-operator from official helm chart on Kubernetes 1.14.2, make a backup using backup-operator to S3, redeploy etcd-operator and create EtcdRestore CR pointing to backup on S3.

Anything else we need to know?:

etcd-operator version: v0.9.4

Restore operation works fine on Kubernetes 1.13.6

Environment: