Closed ckotzbauer closed 4 years ago
So I suspect this behaviour is related to the change to the readinessProbe
as part of #586 .
Would you be able to provide the output of kubectl get events
and kubectl describe <elasticsearch master pod>
.
Also, as a general point of interest, if you're pinning to a custom 6.4.2
image, what do you aim to accomplish by bumping the chart version?
As the only real reason to bump the chart version is to pick up changes related to new versions and bug fixes.
Yes, I also think that #586 is the reason for this.
The events are not visible anymore and I can't restart the cluster right now. If you really need them I can do it today in the evening...
I did not explicit upgrade the chart, because there was an update. But I made changes to the chart-settings yesterday and therefore implicit upgraded the chart to the newest version, as I'm not pinning the chart-version if it's not really needed.
The events are not visible anymore and I can't restart the cluster right now. If you really need them I can do it today in the evening...
Ok, I'll see if I can replicate locally, and I'll come back if need the events...
But I made changes to the chart-settings yesterday and therefore implicit upgraded the chart to the newest version, as I'm not pinning the chart-version if it's not really needed.
Ah, ok... So for now, I'd suggest pinning the chart version to a previous version so that any future config updates don't restart in a full restart.
Yes, that's the plan. If you need more infos, I will try to get them 😉
@ckotzbauer OK, I've done a bit more testing locally, and I don't think the change in readinessProbe
is necessarily the issue, as I was able to reproduce the loss of availability on the elasticsearch-master
service by just deleting the current master pod.
That got me looking through the config and history, and I came across this issue: https://github.com/elastic/helm-charts/issues/63.
By setting masterTerminationFix: true
, I didn't observe any service disruption when either deleting the master pod, or doing a helm upgrade
to a new chart version.
Do you want to try applying that config item?
hm, okay that's an interesting point. I know this behavior too, that the election tooks too long and the cluster is down if only the master is deleted... I can try this.
But I (personally) don't think that this causes the problem, that K8s does the rollout too fast. The new pods are marked as ready, so K8s thinks it could go on. The intention of the readinessProbe implementation in the statefulset is to prevent K8s from doing this, by delaying the readiness state until the cluster is recovered again. And that seems the point which does not work anymore. Or did I miss sth. here...? :thinking:
@ckotzbauer So a quick update from me.
We ran some more tests internally, and were able to reproduce the "too fast" rollout on an internal 7.7.0
cluster which has a non-zero amount of data.
After a bit of digging, it looks like there was a set of quotes missed on the initial curl
command that included the &
, which was effectively breaking the command out early...
I've opened #638 which re-works that behaviour, and I'm running some more tests internally...
We'll probably be looking to do a patch release for this pretty quickly.
However my recommendation to you as you're running a 6.4.2
image would be to pin to the 6.8.x
version of the chart.
Thank you so much for digging in @fatmcgav. I really appreciate!
This sounds promising. I will pin my chart to the 6.8.x
version and wait for the patch-release.
Thanks again, for your work!
@ckotzbauer The fix in #638 has been merged, and back-ported to the 6.8 and 7.7 branches for inclusion on the next minor release.
Chart version: 7.7.0
Kubernetes version: 1.17.1
Kubernetes provider: On-prem
Helm Version: 3.2.0
helm get release
outputOutput of helm get release
``` USER VALUES esConfig: elasticsearch.yml: | action.auto_create_index: "-hd-*" esJavaOpts: -Xmx5g -Xms5g esMajorVersion: 6 image: "private-image-based-on-official-oss:6.4.2" imagePullPolicy: IfNotPresent imagePullSecrets: - name: some-credentials imageTag: 6.4.2 ingress: annotations: ingress.kubernetes.io/ssl-redirect: "true" kubernetes.io/ingress.class: nginx nginx.ingress.kubernetes.io/auth-realm: Authentication Required nginx.ingress.kubernetes.io/auth-secret: dev-elasticsearch-ingress-auth nginx.ingress.kubernetes.io/auth-type: basic nginx.ingress.kubernetes.io/proxy-body-size: 60m enabled: true hosts: - "" path: / tls: - hosts: - "" lifecycle: postStart: exec: command: - bash - -c - | #!/bin/bash cd /usr/share/elasticsearch/plugins/xxx /opt/jdk-10.0.2/bin/jar -cf config.jar config.cfg chmod 777 config.jar persistence: enabled: true podSecurityPolicy: create: false rbac: create: true resources: limits: cpu: 2000m memory: 8Gi requests: cpu: 200m memory: 8Gi sysctlInitContainer: enabled: false volumeClaimTemplate: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: rook-ceph-cephfs ```Describe the bug: I updated the chart to the newest version 7.7.0 and expected that the three-elastic nodes are updated one after another, waiting until the cluster is green again. (The most recent restarted pod was not ready, until the cluster was green again in the past). Now, the pod became ready after a few minutes and kubernetes moved on too quickly, so the cluster was red and not down.
Steps to reproduce:
Expected behavior: The readiness probe is working as expected, and mark the pod as not ready, until the cluster is green again.
Any additional context: I did the update of my release often in the past without such problems, but always today with the new version.