'stable/etcd-operator' is not really stable for [coordination]

arm4b commented 4 years ago

After seeing the following in logs when cluster couldn't start itself or even start clean if all etcd pods were killed:

level=warning msg="all etcd pods are dead." cluster-name=etcd-cluster cluster-namespace=default pkg=cluster

This situation is not recovered by etcd-operator. https://github.com/coreos/etcd-operator/blob/8347d27afa18b6c76d4a8bb85ad56a2e60927018/pkg/cluster/cluster.go#L248-L252

Researching further looks like there are quite a lot of cases when etcd-operator can't recover itself:

Because this backend is needed just for short-lived coordination locks, consider switching to Redis or even single-instance etcd like it was before (https://github.com/StackStorm/stackstorm-ha/pull/52)?

trstruth commented 4 years ago

:(

danielburrell commented 4 years ago

Further to this, the etc-operator project has just been archived and is now in maintenance mode. Not good to have unfinished components slowly rot over time. Can we drop the etcd-operator dependency somehow?

I wrote something here earlier https://forum.stackstorm.com/t/etcd-operator-project-archived/1140

arm4b commented 4 years ago

@danielburrell Thanks for letting us know about etc-operator archived state.

Yes, the next step would be identifying another coordination backend that has good helm charts and works well. Best bet is Redis for now, but it also could be Memcached or other alternatives from https://docs.openstack.org/tooz/latest/user/drivers.html

StackStorm / stackstorm-k8s

'stable/etcd-operator' is not really stable for [coordination] #94