Closed mboutet closed 4 years ago
Scaling down the statefulset to 0 and then scaling back up to 3 returns the etcd cluster back to a healthy state, given there is a snapshot to restore from. However, we should look into this flaky behaviour. Perhaps the prestop-hook.sh
is not behaving as expected?
Here are the logs for one of the etcd node before restarting:
2020-07-22 18:43:11.635930 W | rafthttp: lost the TCP streaming connection with peer 1cf9e1c4e2097dd3 (stream MsgApp v2 reader)
raft2020/07/22 18:43:11 INFO: f576434d574a84a6 switched to configuration voters=(2087948137985048019 3186439143105623338)
2020-07-22 18:43:11.636132 W | rafthttp: lost the TCP streaming connection with peer 2c388184a735f52a (stream MsgApp v2 reader)
2020-07-22 18:43:11.636152 I | etcdserver/membership: removed member f576434d574a84a6 from cluster 28db4cf9c4be26fd
2020-07-22 18:43:11.636450 W | rafthttp: lost the TCP streaming connection with peer 2c388184a735f52a (stream Message reader)
2020-07-22 18:43:11.642567 E | rafthttp: failed to dial 2c388184a735f52a on stream MsgApp v2 (the member has been permanently removed from the cluster)
2020-07-22 18:43:11.642583 I | rafthttp: peer 2c388184a735f52a became inactive (message send to peer failed)
2020-07-22 18:43:11.642600 E | etcdserver: the member has been permanently removed from the cluster
2020-07-22 18:43:11.642606 I | etcdserver: the data-dir used by this member must be removed.
2020-07-22 18:43:11.642650 I | rafthttp: stopped HTTP pipelining with peer 2c388184a735f52a
2020-07-22 18:43:11.642748 I | rafthttp: stopped HTTP pipelining with peer 1cf9e1c4e2097dd3
2020-07-22 18:43:11.642755 I | rafthttp: stopping peer 1cf9e1c4e2097dd3...
2020-07-22 18:43:11.643134 I | rafthttp: closed the TCP streaming connection with peer 1cf9e1c4e2097dd3 (stream MsgApp v2 writer)
2020-07-22 18:43:11.643146 I | rafthttp: stopped streaming with peer 1cf9e1c4e2097dd3 (writer)
2020-07-22 18:43:11.643609 I | rafthttp: closed the TCP streaming connection with peer 1cf9e1c4e2097dd3 (stream Message writer)
2020-07-22 18:43:11.643621 I | rafthttp: stopped streaming with peer 1cf9e1c4e2097dd3 (writer)
2020-07-22 18:43:11.643736 I | rafthttp: stopped HTTP pipelining with peer 1cf9e1c4e2097dd3
2020-07-22 18:43:11.643796 E | rafthttp: failed to dial 1cf9e1c4e2097dd3 on stream MsgApp v2 (context canceled)
2020-07-22 18:43:11.643804 I | rafthttp: peer 1cf9e1c4e2097dd3 became inactive (message send to peer failed)
2020-07-22 18:43:11.643816 I | rafthttp: stopped streaming with peer 1cf9e1c4e2097dd3 (stream MsgApp v2 reader)
2020-07-22 18:43:11.643874 W | rafthttp: lost the TCP streaming connection with peer 1cf9e1c4e2097dd3 (stream Message reader)
2020-07-22 18:43:11.643887 I | rafthttp: stopped streaming with peer 1cf9e1c4e2097dd3 (stream Message reader)
2020-07-22 18:43:11.643895 I | rafthttp: stopped peer 1cf9e1c4e2097dd3
2020-07-22 18:43:11.643901 I | rafthttp: stopping peer 2c388184a735f52a...
2020-07-22 18:43:11.644686 I | rafthttp: closed the TCP streaming connection with peer 2c388184a735f52a (stream MsgApp v2 writer)
2020-07-22 18:43:11.644696 I | rafthttp: stopped streaming with peer 2c388184a735f52a (writer)
2020-07-22 18:43:11.645004 I | rafthttp: closed the TCP streaming connection with peer 2c388184a735f52a (stream Message writer)
2020-07-22 18:43:11.645011 I | rafthttp: stopped streaming with peer 2c388184a735f52a (writer)
2020-07-22 18:43:11.645248 I | rafthttp: stopped HTTP pipelining with peer 2c388184a735f52a
2020-07-22 18:43:11.645262 I | rafthttp: stopped streaming with peer 2c388184a735f52a (stream MsgApp v2 reader)
2020-07-22 18:43:11.645319 I | rafthttp: stopped streaming with peer 2c388184a735f52a (stream Message reader)
2020-07-22 18:43:11.645329 I | rafthttp: stopped peer 2c388184a735f52a
May be related to https://github.com/bitnami/charts/issues/1908
Thanks, @mboutet, @rgarcia89. We will try to reproduce it and look into it.
Hi @mboutet @rgarcia89
I was responsible of these changes. They were meant to address this issue since the snapshotter wasn't working properly when there was only one replica. However, I don't see how these changes could break this...
This error suggests that the "$ETCD_DATA_DIR/member_id"
file was not correctly created:
Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
That file is created using this function that hasn't been modified recently:
store_member_id() {
while ! etcdctl $AUTH_OPTIONS member list; do sleep 1; done
etcdctl $AUTH_OPTIONS member list | grep -w "$HOSTNAME" | awk '{ print $1}' | awk -F "," '{ print $1}' > "$ETCD_DATA_DIR/member_id"
echo "==> Stored member id: $(cat ${ETCD_DATA_DIR}/member_id)" 1>&3 2>&4
exit 0
}
I need to continue looking into this
Acknowledged with bitnami/etcd:3.4.9-debian-10-r45 on K8S 1.18.3
We drained a node where etcd-0 was located. K8S moved etcd-0 to another node, but it hangs in CrashLoopBackoff
2020-07-23 07:25:52.759821 I | etcdserver/membership: added member 45a18acb10aa275e [http://etcd-2.etcd-headless.default.svc.cluster.local:2380] to cluster 9e98e654be1e22d7 from store 2020-07-23 07:25:52.759843 I | etcdserver/membership: added member 936ce633ac273d75 [http://etcd-1.etcd-headless.default.svc.cluster.local:2380] to cluster 9e98e654be1e22d7 from store 2020-07-23 07:25:52.759851 I | etcdserver/membership: added member df3b2df95cd5fd29 [http://etcd-0.etcd-headless.default.svc.cluster.local:2380] to cluster 9e98e654be1e22d7 from store 2020-07-23 07:25:52.763332 W | auth: simple token is not cryptographically signed 2020-07-23 07:25:52.777193 I | rafthttp: starting peer 45a18acb10aa275e... 2020-07-23 07:25:52.777266 I | rafthttp: started HTTP pipelining with peer 45a18acb10aa275e 2020-07-23 07:25:52.777665 I | rafthttp: started streaming with peer 45a18acb10aa275e (writer) 2020-07-23 07:25:52.777829 I | rafthttp: started streaming with peer 45a18acb10aa275e (writer) 2020-07-23 07:25:52.780638 I | rafthttp: started streaming with peer 45a18acb10aa275e (stream MsgApp v2 reader) 2020-07-23 07:25:52.780679 I | rafthttp: started streaming with peer 45a18acb10aa275e (stream Message reader) 2020-07-23 07:25:52.781007 I | rafthttp: started peer 45a18acb10aa275e 2020-07-23 07:25:52.781054 I | rafthttp: added peer 45a18acb10aa275e 2020-07-23 07:25:52.781068 I | rafthttp: starting peer 936ce633ac273d75... 2020-07-23 07:25:52.781670 I | rafthttp: started HTTP pipelining with peer 936ce633ac273d75 2020-07-23 07:25:52.782023 I | rafthttp: started streaming with peer 936ce633ac273d75 (writer) 2020-07-23 07:25:52.782158 I | rafthttp: started streaming with peer 936ce633ac273d75 (writer) 2020-07-23 07:25:52.783324 I | rafthttp: started peer 936ce633ac273d75 2020-07-23 07:25:52.783520 I | rafthttp: added peer 936ce633ac273d75 2020-07-23 07:25:52.783564 I | etcdserver: starting server... [version: 3.4.9, cluster version: to_be_decided] 2020-07-23 07:25:52.783687 I | rafthttp: started streaming with peer 936ce633ac273d75 (stream Message reader) 2020-07-23 07:25:52.783982 I | rafthttp: started streaming with peer 936ce633ac273d75 (stream MsgApp v2 reader) 2020-07-23 07:25:52.785088 E | etcdserver: the member has been permanently removed from the cluster 2020-07-23 07:25:52.785152 I | etcdserver: the data-dir used by this member must be removed. 2020-07-23 07:25:52.785233 I | etcdserver: aborting publish because server is stopped 2020-07-23 07:25:52.785264 I | rafthttp: stopping peer 45a18acb10aa275e... 2020-07-23 07:25:52.785281 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer) 2020-07-23 07:25:52.785290 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer) 2020-07-23 07:25:52.785349 I | rafthttp: stopped HTTP pipelining with peer 45a18acb10aa275e 2020-07-23 07:25:52.785380 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream MsgApp v2 reader) 2020-07-23 07:25:52.785400 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream Message reader) 2020-07-23 07:25:52.785405 I | rafthttp: stopped peer 45a18acb10aa275e 2020-07-23 07:25:52.785410 I | rafthttp: stopping peer 936ce633ac273d75... 2020-07-23 07:25:52.785424 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer) 2020-07-23 07:25:52.785435 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer) 2020-07-23 07:25:52.785461 I | rafthttp: stopped HTTP pipelining with peer 936ce633ac273d75 2020-07-23 07:25:52.785501 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream MsgApp v2 reader) 2020-07-23 07:25:52.785550 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream Message reader) 2020-07-23 07:25:52.785565 I | rafthttp: stopped peer 936ce633ac273d75 2020-07-23 07:25:52.790550 I | embed: listening for peers on [::]:2380 2020-07-23 07:25:52.790843 E | rafthttp: failed to find member 45a18acb10aa275e in cluster 9e98e654be1e22d7 2020-07-23 07:25:52.790945 E | rafthttp: failed to find member 45a18acb10aa275e in cluster 9e98e654be1e22d7
Acknowledged the same problem but just manually. I used 3 replicas and if delete one pod it'll come up but if you delete two then these two will start failing in a loop and never come up. image used: bitnami/etcd:3.4.9-debian-10-r52, bitnami/etcd:3.4.9 chart: 4.8.12
Hi, i suffer from the same issue...
You can go back and use --version 4.8.10
to get it running again
i meant im suffering from the same issue: whenever a pod dies - it cant re-join the cluster.
I know but have you rolled back to the old helm chart version and set the clusterstate variable to existing? I added this a few merge requests ago. With that fix it works. At least as long as you stay below helm chart version 4.8.11
i just tried that, the effect on the rollout looks like this: the first node goes down, being replaced and joins the cluster well then the second node goes down - and cant rejoin the cluster.... the rollout doesnt reach the third node...
can you show me your helm install command as well as a kubectl describe of the not starting etcd pods?
the upgrade command:
helm upgrade etcd-jenkins bitnami/etcd -f values-production.yaml --set etcd.initialClusterState=existing --version 4.8.10
describing the failed pod:
Name: etcd-jenkins-1
Namespace: default
Priority: 0
Node: ip-IP.ec2.internal/IP
Start Time: Thu, 23 Jul 2020 11:33:17 +0000
Labels: app.kubernetes.io/instance=etcd-jenkins
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=etcd
controller-revision-hash=etcd-jenkins-b9478bbd9
helm.sh/chart=etcd-4.8.10
statefulset.kubernetes.io/pod-name=etcd-jenkins-1
Annotations: kubernetes.io/psp: eks.privileged
prometheus.io/port: 2379
prometheus.io/scrape: true
Status: Running
IP: IP
IPs:
IP: IP
Controlled By: StatefulSet/etcd-jenkins
Containers:
etcd:
Container ID: docker://a4a0a111ea9a11abbbcaedc5ec4179782415ba355d97c93c928205c24c385003
Image: docker.io/bitnami/etcd:3.4.9-debian-10-r46
Image ID: docker-pullable://bitnami/etcd@sha256:4369300e9c2f55312bf059a44235c00f86c121fd3c6f9f33ee5cfdfd773ea76d
Ports: 2379/TCP, 2380/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/scripts/setup.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Jul 2020 11:39:54 +0000
Finished: Thu, 23 Jul 2020 11:40:01 +0000
Ready: False
Restart Count: 6
Liveness: exec [/scripts/probes.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
Readiness: exec [/scripts/probes.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
BITNAMI_DEBUG: true
MY_POD_IP: (v1:status.podIP)
MY_POD_NAME: etcd-jenkins-1 (v1:metadata.name)
ETCDCTL_API: 3
ETCD_NAME: $(MY_POD_NAME)
ETCD_DATA_DIR: /bitnami/etcd/data
ETCD_ADVERTISE_CLIENT_URLS: http://$(MY_POD_NAME).etcd-jenkins-headless.default.svc.cluster.local:2379
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS: https://$(MY_POD_NAME).etcd-jenkins-headless.default.svc.cluster.local:2380
ETCD_LISTEN_PEER_URLS: https://0.0.0.0:2380
ETCD_INITIAL_CLUSTER_TOKEN: etcd-cluster-k8s
ETCD_INITIAL_CLUSTER_STATE: existing
ETCD_INITIAL_CLUSTER: etcd-jenkins-0=https://etcd-jenkins-0.etcd-jenkins-headless.default.svc.cluster.local:2380,etcd-jenkins-1=https://etcd-jenkins-1.etcd-jenkins-headless.default.svc.cluster.local:2380,etcd-jenkins-2=https://etcd-jenkins-2.etcd-jenkins-headless.default.svc.cluster.local:2380,
ALLOW_NONE_AUTHENTICATION: yes
ETCD_PEER_AUTO_TLS: true
Mounts:
/bitnami/etcd from data (rw)
/scripts/prestop-hook.sh from scripts (rw,path="prestop-hook.sh")
/scripts/probes.sh from scripts (rw,path="probes.sh")
/scripts/setup.sh from scripts (rw,path="setup.sh")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qbjnq (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-etcd-jenkins-1
ReadOnly: false
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etcd-jenkins-scripts
Optional: false
default-token-qbjnq:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qbjnq
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/etcd-jenkins-1 to ip-IP.ec2.internal
Normal SuccessfulAttachVolume 9m15s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3bcaa7a1-9bbe-48ed-bfdf-75775c1ee7b2"
Normal Pulled 7m18s (x5 over 9m13s) kubelet, ip-IP.ec2.internal Container image "docker.io/bitnami/etcd:3.4.9-debian-10-r46" already present on machine
Normal Created 7m18s (x5 over 9m13s) kubelet, ip-IP-ec2.internal Created container etcd
Normal Started 7m18s (x5 over 9m13s) kubelet, ip-IP.ec2.internal Started container etcd
Warning BackOff 4m8s (x21 over 8m55s) kubelet, ip-IP.ec2.internal Back-off restarting failed container
Same Issue here with 4.8.10:
Helm List:
etcd 1 Thu Jul 23 11:46:05 2020 DEPLOYED etcd-4.8.10 3.4.9 default
Install command:
helm install --name etcd bitnami/etcd -f etcdvalues.yaml --version 4.8.10
Pod description:
Namespace: default
Priority: 0
Node: perftest-w6/10.83.19.18
Start Time: Thu, 23 Jul 2020 11:48:12 +0000
Labels: app.kubernetes.io/instance=etcd
app.kubernetes.io/managed-by=Tiller
app.kubernetes.io/name=etcd
controller-revision-hash=etcd-85f5c67bf
helm.sh/chart=etcd-4.8.10
statefulset.kubernetes.io/pod-name=etcd-0
Annotations: prometheus.io/port: 2379
prometheus.io/scrape: true
Status: Running
IP: 10.244.8.215
IPs:
IP: 10.244.8.215
Controlled By: StatefulSet/etcd
Containers:
etcd:
Container ID: docker://e581a007d63b7021d35c24ce39fc234fbfe2102ffe41308d667bea04ce32280a
Image: docker.io/bitnami/etcd:3.4.9-debian-10-r46
Image ID: docker-pullable://bitnami/etcd@sha256:4369300e9c2f55312bf059a44235c00f86c121fd3c6f9f33ee5cfdfd773ea76d
Ports: 2379/TCP, 2380/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/scripts/setup.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Jul 2020 11:49:16 +0000
Finished: Thu, 23 Jul 2020 11:49:16 +0000
Ready: False
Restart Count: 3
Liveness: exec [/scripts/probes.sh] delay=60s timeout=5s period=30s #success=1 #failure=5
Readiness: exec [/scripts/probes.sh] delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
BITNAMI_DEBUG: false
MY_POD_IP: (v1:status.podIP)
MY_POD_NAME: etcd-0 (v1:metadata.name)
ETCDCTL_API: 3
ETCD_NAME: $(MY_POD_NAME)
ETCD_DATA_DIR: /bitnami/etcd/data
ETCD_ADVERTISE_CLIENT_URLS: http://$(MY_POD_NAME).etcd-headless.default.svc.cluster.local:2379
ETCD_LISTEN_CLIENT_URLS: http://0.0.0.0:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS: http://$(MY_POD_NAME).etcd-headless.default.svc.cluster.local:2380
ETCD_LISTEN_PEER_URLS: http://0.0.0.0:2380
ETCD_INITIAL_CLUSTER_TOKEN: etcd-cluster-k8s
ETCD_INITIAL_CLUSTER_STATE: new
ETCD_INITIAL_CLUSTER: etcd-0=http://etcd-0.etcd-headless.default.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.default.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.default.svc.cluster.local:2380,
ALLOW_NONE_AUTHENTICATION: yes
Mounts:
/bitnami/etcd from data (rw)
/init-snapshot from init-snapshot-volume (rw)
/scripts/prestop-hook.sh from scripts (rw,path="prestop-hook.sh")
/scripts/probes.sh from scripts (rw,path="probes.sh")
/scripts/setup.sh from scripts (rw,path="setup.sh")
/snapshots from snapshot-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pdqcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-etcd-0
ReadOnly: false
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etcd-scripts
Optional: false
init-snapshot-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: snapshots
ReadOnly: false
snapshot-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: etcd-snapshotter
ReadOnly: false
default-token-pdqcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pdqcx
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/etcd-0 to perftest-w6
Normal SuccessfulAttachVolume 79s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-47558fe6-7104-4b0d-a7d1-6e0a9218dc78"
Normal SuccessfulAttachVolume 79s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-26bdb3aa-fde0-4d61-9359-163fab6e5cf8"
Normal Pulled 17s (x4 over 70s) kubelet, perftest-w6 Container image "docker.io/bitnami/etcd:3.4.9-debian-10-r46" already present on machine
Normal Created 17s (x4 over 70s) kubelet, perftest-w6 Created container etcd
Normal Started 16s (x4 over 69s) kubelet, perftest-w6 Started container etcd
Warning BackOff 1s (x8 over 68s) kubelet, perftest-w6 Back-off restarting failed container```
Pod Log:
```2020-07-23 11:48:44.317120 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer)
2020-07-23 11:48:44.317317 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer)
2020-07-23 11:48:44.317364 I | rafthttp: stopped HTTP pipelining with peer 45a18acb10aa275e
2020-07-23 11:48:44.317489 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream MsgApp v2 reader)
2020-07-23 11:48:44.317501 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream Message reader)
2020-07-23 11:48:44.317505 I | rafthttp: stopped peer 45a18acb10aa275e
2020-07-23 11:48:44.317509 I | rafthttp: stopping peer 936ce633ac273d75...
2020-07-23 11:48:44.317515 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer)
2020-07-23 11:48:44.317521 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer)
2020-07-23 11:48:44.317534 I | rafthttp: stopped HTTP pipelining with peer 936ce633ac273d75
2020-07-23 11:48:44.317544 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream MsgApp v2 reader)
2020-07-23 11:48:44.317551 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream Message reader)
2020-07-23 11:48:44.317555 I | rafthttp: stopped peer 936ce633ac273d75```
@dk-do if your cluster is not new, the cluster state variable should be existing
@Alexc0007 I dont see any issue with your config :-/
i also dont see any issues with my config, however i have exactly the same issue as @dk-do i also tried older versions, and nothing works... (i tried 4.8.9 and 4.8.7)
btw, I've implemented a more efficient way for changing the cluster state variable on upgrade:
{{- if .Release.IsInstall }}
- name: ETCD_INITIAL_CLUSTER_STATE
value: new
{{- else }}
- name: ETCD_INITIAL_CLUSTER_STATE
value: existing
{{- end }}
In this case, you don't need to specify it in values.
@sfesfizh
btw, I've implemented a more efficient way for changing the cluster state variable on upgrade:
{{- if .Release.IsInstall }} - name: ETCD_INITIAL_CLUSTER_STATE value: new {{- else }} - name: ETCD_INITIAL_CLUSTER_STATE value: existing {{- end }}
In this case, you don't need to specify it in values.
I don't know if we should open a PR/new issue for that (since it's a little off-topic to the problem here). Anyway, I was also thinking that this would be a better way to handle the initial cluster state. However, I wonder if it would be more robust to handle the ETCD_INITIAL_CLUSTER_STATE
logic in the entrypoint in case one or more etcd nodes restart before a first upgrade is performed. Otherwise, the restarted node will think that the cluster is new whereas it is in fact existing.
@rgarcia89
What did I do? Currently we are investigating a desaster recovery / worst case scenario.
So we deleted the old deployment:
helm delete etcd --purge
And created a new cluster with these settings in values.yaml:
startFromSnapshot:
enabled: true
## Existingn PVC containing the etcd snapshot
##
existingClaim: snapshots
## Snapshot filename
##
snapshotFilename: db
Three pods came up successfully and all components worked fine with etcd. Then, to check if this issue is solved in Chart version 4.8.10, we deleted pod etcd-0. It came up again but was in CrashLoopBackOff and never started. We have just these log entries:
raft2020/07/23 12:50:34 INFO: df3b2df95cd5fd29 switched to configuration voters=(5017444064630024030 10623118730666130805 16085501043111492905)
2020-07-23 12:50:34.291575 I | etcdserver/membership: added member df3b2df95cd5fd29 [http://etcd-0.etcd-headless.default.svc.cluster.local:2380] to cluster 9e98e654be1e22d7
raft2020/07/23 12:50:34 INFO: raft.node: df3b2df95cd5fd29 elected leader 45a18acb10aa275e at term 23
2020-07-23 12:50:34.293342 E | etcdserver: the member has been permanently removed from the cluster
2020-07-23 12:50:34.293355 I | etcdserver: the data-dir used by this member must be removed.
2020-07-23 12:50:34.293392 E | etcdserver: publish error: etcdserver: request cancelled
2020-07-23 12:50:34.293410 E | etcdserver: publish error: etcdserver: request cancelled
2020-07-23 12:50:34.293420 I | etcdserver: aborting publish because server is stopped
2020-07-23 12:50:34.293479 I | rafthttp: stopping peer 45a18acb10aa275e...
2020-07-23 12:50:34.293501 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer)
2020-07-23 12:50:34.293514 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (writer)
2020-07-23 12:50:34.293981 I | rafthttp: stopped HTTP pipelining with peer 45a18acb10aa275e
2020-07-23 12:50:34.294008 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream MsgApp v2 reader)
2020-07-23 12:50:34.294018 I | rafthttp: stopped streaming with peer 45a18acb10aa275e (stream Message reader)
2020-07-23 12:50:34.294023 I | rafthttp: stopped peer 45a18acb10aa275e
2020-07-23 12:50:34.294027 I | rafthttp: stopping peer 936ce633ac273d75...
2020-07-23 12:50:34.294038 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer)
2020-07-23 12:50:34.294046 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (writer)
2020-07-23 12:50:34.294056 I | rafthttp: stopped HTTP pipelining with peer 936ce633ac273d75
2020-07-23 12:50:34.294072 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream MsgApp v2 reader)
2020-07-23 12:50:34.294085 I | rafthttp: stopped streaming with peer 936ce633ac273d75 (stream Message reader)
2020-07-23 12:50:34.294089 I | rafthttp: stopped peer 936ce633ac273d75
2020-07-23 12:50:34.303961 W | rafthttp: failed to process raft message (raft: stopped)
Update: after scaling the statefulset down and up again, everything came up again without errors. But the pod has still this setting:
ETCD_INITIAL_CLUSTER_STATE: new
When does it change to EXISTING?
When does it change to EXISTING?
It must be changed manually in values or --set configured before upgrade
I've provided a small workaround above how to avoid manually changing.
@dk-do it is something that you can just place in your values file. Like @sfesfizh said after first deployment of the etcd cluster you have to set it to existing. After that further deployments / updates should work without any issues. At least for me it is like that
there is one more important thing to take into consideration... i am using persistent volumes, which means my etcd data is on a an "external" disk. when the pods go up, they read this data and according to it try to define if its a new or an existing cluster.
if i delete those disks and re-deploy etcd, its creating a new cluster without any issues(ofcourse cluster state variable is set to "new") but by using existing disks that already contain etcd data, scaling down and up is impossible...
there is one more important thing to take into consideration... i am using persistent volumes, which means my etcd data is on a an "external" disk. when the pods go up, they read this data and according to it try to define if its a new or an existing cluster.
if i delete those disks and re-deploy etcd, its creating a new cluster without any issues(ofcourse cluster state variable is set to "new") but by using existing disks that already contain etcd data, scaling down and up is impossible...
+1, same issue for me.
Hi everyone,
I found an issue due to the env vars from the existing cluster not being properly when restarting a container. That was preventing the pod to join the cluster (even when disaster recovery is not necessary). I just created a PR to address it.
Please feel free to give the solution a try
Hi @juan131 , ill gladly test it out... but i guess there is no chart version with the current changes
@juan131 , i applied your changes manually... but it doesnt seem to change anything...
Hi @Alexc0007
You need to clone the repo and apply the changes since theres's no version published yet.
i applied your changes manually... but it doesnt seem to change anything...
Did you try the steps I mentioned in the PR's description? The pods should have been able to rejoin the cluster after being restarted.
Hi, i just looked at the commit and did the same changes in my configmap... (i didnt clone the repo) and it didnt help
@Alexc0007 I really cant follow why helm chart version 4.8.10 is not working for you. Have you just out of interest deployed a new cluster, set the env to existing cluster afterwards, applied the changes to the cluster and then tried to delete a pod, to see if it joins again?
I am on this version - and everything works for me
Hi @rgarcia89 as i explained above, i did install a new cluster(version: 4.8.10), then changed the cluster state to existing, it automatically starts a rollout. the first node is usually replaced OK, then the second node wont re-join the cluster... the rollout doesnt reach the third node...
so i guess this is closed based on a fix that only i tried and didnt work for me? has anyone else of the other members of this thread tried this fix?
Hi @Alexc0007 @rgarcia89
Could you please give a try to the latest version we just released (4.9.1)?
$ helm repo update
...Successfully got an update from the "bitnami" chart repository
Update Complete. ⎈ Happy Helming!⎈
$ helm search repo bitnami/etcd
NAME CHART VERSION APP VERSION DESCRIPTION
bitnami/etcd 4.9.1 3.4.10 etcd is a distributed key value store that **prov...**
@juan131 it is not working. I just run a test. Installed with 3.4.10-debian-10-r1 afterwards tried to upgrade to version 3.4.10-debian-10-r4. Pod does not come up anymore. See the screenshot. Even setting the cluster state to existing does not fix the issue.
The pod stays in loop
also in my helm repo I am only seeing version 4.9.0
[raulgs@raulgs-xm1 etcd]$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "bitnami" chart repository
...Successfully got an update from the "stable" chart repository
Update Complete. ⎈ Happy Helming!⎈
[raulgs@raulgs-xm1 etcd]$ helm search repo bitnami/etcd
NAME CHART VERSION APP VERSION DESCRIPTION
bitnami/etcd 4.9.0 3.4.10 etcd is a distributed key value store that prov...
I have been trying around a bit. It seems to be image and helm chart version related.
If I deploy the helm chart with this command
helm upgrade --install --namespace raul --wait -f values/raul.yaml --version 4.8.4 etcd bitnami/etcd
using the following values file: https://pastebin.com/YA3nHcYL
and then afterwards for example upgrade to version 3.4.9-debian-10-r54
with clusterstate existing
so using this values file: https://pastebin.com/wvR6hzSu
things are working fine.
However, it is not working with image 3.4.10 and also not with newer helm chart versions...
Hi @rgarcia89
It could be related with these changes:
possibly I don't know what commands are run that could be failing because of this change. However, the output of the image looks like that:
[raulgs@raulgs-xm1 ~]$ klogs -f pod/etcd-2
==> Bash debug is off
==> Detected data from previous deployments...
==> Adding new member to existing cluster...
Hi Everyone, i can confirm that after switching to image: 3.4.9-debian-10-r54 with helm version 4.9.1 - everything works well i created a fresh cluster with the image above:
helm install etcd-jenkins bitnami/etcd -f values-production.yaml --set etcd.initialClusterState=new --version 4.9.1
then upgraded s follows:
helm upgrade etcd-jenkins bitnami/etcd -f values-production.yaml --set etcd.initialClusterState=existing --version 4.9.1
then a rollout started, and was completed successfully - all old pods were terminated and replaced by new ones joining the current cluster. this is good. thanks to @rgarcia89 !
I'm glad you were able to use the latest version of the chart @Alexc0007 without issues. Did you try the same version of the chart but switching to the latest image 3.4.10-debian-10-r1
?
Installing the chart from scratch I found no issues with the latest image/chart.
hi, i didnt try the latest image, ill try it later and report.
chart version etcd-6.1.2
with image 3.4.15-debian-10-r14
still have same problem
values.yml changes:
enabled: false
to disable rbacRun helm install etcd-test ./etcd
the pod etcd-test-0
is running normally
than change values.yml L201 to replicaCount: 7
and run helm upgrade etcd-test ./etcd
to increase cluster size, this step successful finished and pod etcd-test-{1..6}
are running normally
but, when decrease replicaCount to 5, and run helm upgrade to apply changes, the etcd-test-4
has change to CrashLoopBackOff
state, other pods not updated
Hi @KagurazakaNyaa
than change values.yml L201 to replicaCount: 7 and run helm upgrade etcd-test ./etcd to increase cluster size, this step successful finished and pod etcd-test-{1..6} are running normally
Note that wit this new major version, it's not mandatory to scale the solution using helm upgrade ...
, you can use kubectl scale ...
which is simpler and faster, see:
but, when decrease replicaCount to 5, and run helm upgrade to apply changes, the etcd-test-4 has change to CrashLoopBackOff state, other pods not updated
Could you share the logs of the etcd-test-4
pod? Also, could you try to decrease using kubectl scale ...
and let us know if you find any issue in that case?, see:
Thanks for your reply @juan131
I tried to reproduce my operation, and the etcd-test-4
pod log like this:
etcd 07:22:59.66
etcd 07:22:59.66 Welcome to the Bitnami etcd container
etcd 07:22:59.66 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-etcd
etcd 07:22:59.66 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-etcd/issues
etcd 07:22:59.66
etcd 07:22:59.66 INFO ==> ** Starting etcd setup **
etcd 07:22:59.67 INFO ==> Validating settings in ETCD_* env vars..
etcd 07:22:59.67 WARN ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 07:22:59.67 INFO ==> Initializing etcd
etcd 07:22:59.68 INFO ==> Detected data from previous deployments
etcd 07:23:09.75 INFO ==> Updating member in existing cluster
Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
and I will delete this cluster and try use kubectl scale
to change scale
But the initial purpose of my test is to automatically recover after several nodes of the etcd cluster are dynamically updated or deleted when the kubernetes cluster is upgraded or migrated.
In the initial test, I used the command kubectl delete pod etcd-test-1
to test, and the same problem occurred, I also tried to delete the corresponding pvc and re-execute the command, but it still has no effect
Hi @KagurazakaNyaa
etcd 07:23:09.75 INFO ==> Updating member in existing cluster
Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
I wasn't able to reproduce the error above ⏫ . This is what I did:
$ helm install etcd bitnami/etcd --set replicaCount=3
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
etcd-0 0/1 Pending 0 0s
etcd-1 0/1 Pending 0 0s
etcd-2 0/1 Pending 0 0s
...
etcd-2 1/1 Running 0 75s
etcd-0 1/1 Running 0 78s
etcd-1 1/1 Running 0 84s
etcd-client
pod as per installations notes and access it:$ kubectl run etcd-client --restart='Never' --image docker.io/bitnami/etcd:3.4.15-debian-10-r14 --env ROOT_PASSWORD=$(kubectl get secret --namespace default etcd -o jsonpath="{.data.etcd-root-password}" | base64 --decode) --env ETCDCTL_ENDPOINTS="etcd.default.svc.cluster.local:2379" --namespace default --command -- sleep infinity
$ kubectl exec -it etcd-client -- bash
$ etcdctl member list
45a18acb10aa275e, started, etcd-2, http://etcd-2.etcd-headless.default.svc.cluster.local:2380, http://etcd-2.etcd-headless.default.svc.cluster.local:2379, false
936ce633ac273d75, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
current_replicas=3
desired_replicas=7
while [[ current_replicas -lt desired_replicas ]]; do
kubectl scale --replicas=$((current_replicas + 1)) statefulset/etcd
kubectl rollout status statefulset/etcd
current_replicas=$((current_replicas + 1))
done
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 11m
etcd-1 1/1 Running 0 11m
etcd-2 1/1 Running 0 11m
etcd-3 0/1 Pending 0 2s
...
etcd-6 0/1 Running 0 14s
etcd-6 1/1 Running 0 76s
$ kubectl exec -it etcd-client -- etcdctl member list
2bc27f0bc39445f7, started, etcd-4, http://etcd-4.etcd-headless.default.svc.cluster.local:2380, http://etcd-4.etcd-headless.default.svc.cluster.local:2379, false
37d45ca3d0f2410f, started, etcd-3, http://etcd-3.etcd-headless.default.svc.cluster.local:2380, http://etcd-3.etcd-headless.default.svc.cluster.local:2379, false
45a18acb10aa275e, started, etcd-2, http://etcd-2.etcd-headless.default.svc.cluster.local:2380, http://etcd-2.etcd-headless.default.svc.cluster.local:2379, false
936ce633ac273d75, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
d0dfbfbac07bfe6b, started, etcd-5, http://etcd-5.etcd-headless.default.svc.cluster.local:2380, http://etcd-5.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
ebbcdfb56ab0db90, started, etcd-6, http://etcd-6.etcd-headless.default.svc.cluster.local:2380, http://etcd-6.etcd-headless.default.svc.cluster.local:2379, false
etcd-2
) and check with the etcd-client
what happens. In theory, it should be:
$ kubectl delete pod etcd-2
$ kubectl exec -it etcd-client -- etcdctl member list
2bc27f0bc39445f7, started, etcd-4, http://etcd-4.etcd-headless.default.svc.cluster.local:2380, http://etcd-4.etcd-headless.default.svc.cluster.local:2379, false
37d45ca3d0f2410f, started, etcd-3, http://etcd-3.etcd-headless.default.svc.cluster.local:2380, http://etcd-3.etcd-headless.default.svc.cluster.local:2379, false
936ce633ac273d75, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
d0dfbfbac07bfe6b, started, etcd-5, http://etcd-5.etcd-headless.default.svc.cluster.local:2380, http://etcd-5.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
ebbcdfb56ab0db90, started, etcd-6, http://etcd-6.etcd-headless.default.svc.cluster.local:2380, http://etcd-6.etcd-headless.default.svc.cluster.local:2379, false
$ kubectl get pod etcd-2 -w
NAME READY STATUS RESTARTS AGE
etcd-2 0/1 ContainerCreating 0 9s
etcd-2 0/1 Running 0 10s
etcd-2 1/1 Running 0 74s
$ kubectl exec -it etcd-client -- etcdctl member list
2bc27f0bc39445f7, started, etcd-4, http://etcd-4.etcd-headless.default.svc.cluster.local:2380, http://etcd-4.etcd-headless.default.svc.cluster.local:2379, false
37d45ca3d0f2410f, started, etcd-3, http://etcd-3.etcd-headless.default.svc.cluster.local:2380, http://etcd-3.etcd-headless.default.svc.cluster.local:2379, false
38c9726f082cf87d, started, etcd-2, http://etcd-2.etcd-headless.default.svc.cluster.local:2380, http://etcd-2.etcd-headless.default.svc.cluster.local:2379, false
936ce633ac273d75, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
d0dfbfbac07bfe6b, started, etcd-5, http://etcd-5.etcd-headless.default.svc.cluster.local:2380, http://etcd-5.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
ebbcdfb56ab0db90, started, etcd-6, http://etcd-6.etcd-headless.default.svc.cluster.local:2380, http://etcd-6.etcd-headless.default.svc.cluster.local:2379, false
current_replicas=7
desired_replicas=5
while [[ current_replicas -gt desired_replicas ]]; do
kubectl scale --replicas=$((current_replicas - 1)) statefulset/etcd
kubectl rollout status statefulset/etcd
current_replicas=$((current_replicas - 1))
done
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 23m
etcd-1 1/1 Running 0 23m
etcd-2 1/1 Running 0 3m6s
etcd-3 1/1 Running 0 11m
etcd-4 1/1 Running 0 10m
etcd-6 0/1 Terminating 0 9m16s
...
etcd-5 0/1 Terminating 0 9m23s
$ kubectl exec -it etcd-client -- etcdctl member list
2bc27f0bc39445f7, started, etcd-4, http://etcd-4.etcd-headless.default.svc.cluster.local:2380, http://etcd-4.etcd-headless.default.svc.cluster.local:2379, false
37d45ca3d0f2410f, started, etcd-3, http://etcd-3.etcd-headless.default.svc.cluster.local:2380, http://etcd-3.etcd-headless.default.svc.cluster.local:2379, false
38c9726f082cf87d, started, etcd-2, http://etcd-2.etcd-headless.default.svc.cluster.local:2380, http://etcd-2.etcd-headless.default.svc.cluster.local:2379, false
936ce633ac273d75, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
Hi @juan131 , I think I found the reason. After disabling rbac, any re-created pods cannot join the cluster.
Hi @KagurazakaNyaa
I wasn't able to reproduce that either...
$ helm install etcd bitnami/etcd --set replicaCount=3 --set auth.rbac.enabled=false
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
etcd-0 0/1 Pending 0 0s
etcd-1 0/1 Pending 0 0s
etcd-2 0/1 Pending 0 0s
...
etcd-2 1/1 Running 0 75s
etcd-0 1/1 Running 0 78s
etcd-1 1/1 Running 0 84s
$ kubectl delete pod etcd-1
$ kubectl get pods -w
etcd-1 1/1 Terminating 0 2m14s
etcd-1 0/1 Terminating 0 2m20s
etcd-1 0/1 Terminating 0 2m25s
etcd-1 0/1 Terminating 0 2m25s
etcd-1 0/1 Pending 0 0s
etcd-1 0/1 Pending 0 0s
etcd-1 0/1 ContainerCreating 0 0s
etcd-1 0/1 Running 0 6s
etcd-1 1/1 Running 0 74s
$ kubectl logs etcd-1
...
etcd 14:39:18.92 INFO ==> ** Starting etcd setup **
etcd 14:39:18.93 INFO ==> Validating settings in ETCD_* env vars..
etcd 14:39:18.93 WARN ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 14:39:18.94 INFO ==> Initializing etcd
etcd 14:39:18.95 INFO ==> Detected data from previous deployments
etcd 14:39:29.25 INFO ==> Adding new member to existing cluster
etcd 14:39:34.50 INFO ==> ** etcd setup finished! **
etcd 14:39:34.52 INFO ==> ** Starting etcd **
...
$ kubectl exec -it etcd-client -- etcdctl member list
211c7cde9e20cbd9, started, etcd-1, http://etcd-1.etcd-headless.default.svc.cluster.local:2380, http://etcd-1.etcd-headless.default.svc.cluster.local:2379, false
45a18acb10aa275e, started, etcd-2, http://etcd-2.etcd-headless.default.svc.cluster.local:2380, http://etcd-2.etcd-headless.default.svc.cluster.local:2379, false
df3b2df95cd5fd29, started, etcd-0, http://etcd-0.etcd-headless.default.svc.cluster.local:2380, http://etcd-0.etcd-headless.default.svc.cluster.local:2379, false
the 1st pod ( etcd-prod-0) has been evicted, but it was not returned back till now.
k logs -f etcd-prod-0
etcd 08:56:58.90
etcd 08:56:58.91 Welcome to the Bitnami etcd container
etcd 08:56:58.91 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-etcd
etcd 08:56:58.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-etcd/issues
etcd 08:56:58.92
etcd 08:56:58.92 INFO ==> ** Starting etcd setup **
etcd 08:56:58.95 INFO ==> Validating settings in ETCD_* env vars..
etcd 08:56:58.96 WARN ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 08:56:58.97 INFO ==> Initializing etcd
etcd 08:56:59.00 INFO ==> Detected data from previous deployments
etcd 08:56:59.29 INFO ==> Updating member in existing cluster
Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
Does this mean data corruption of the 1st member ? if so how to recover ?
Known that i enabled disasterRecovery in my helm releases :
disasterRecovery:
enabled: true
cronjob:
schedule: "*/30 * * * *"
historyLimit: 1
## @param disasterRecovery.cronjob.snapshotHistoryLimit Number of etcd snapshots to retain, tagged by date
##
snapshotHistoryLimit: 3
resources:
limits:
cpu: 500m
memory: 1Gi
i fixed the issue (above) by triggering a rolling update:
kubectl rollout restart statefulset/etcd-prod
Which chart: etcd-4.8.14
Describe the bug Issuing
kubectl rollout restart
on a 3-nodes etcd statefulset results in the last node going into crash loopbackTo Reproduce Steps to reproduce the behavior:
etcd.initialClusterState=existing
.kubectl rollout restart
on the statefulsetExpected behavior Each etcd node should smoothly restart.
Version of Helm and Kubernetes:
helm version
:kubectl version
:Additional context I experienced the bug the other day and created an issue (that I closed since I was not able to reproduce) https://github.com/bitnami/charts/issues/3158