elastic / helm-charts

You know, for Kubernetes
Apache License 2.0
1.89k stars 1.93k forks source link

elastic/helm-charts/elasticsearch: Readiness probe failed: Waiting for elasticsearch cluster to become ready #783

Closed melissajenner22 closed 3 years ago

melissajenner22 commented 4 years ago

Chart version: 7.7.1 Kubernetes version: 1.16 Kubernetes provider: E.g. GKE (Google Kubernetes Engine) EKS Helm Version: 2.16.10

helm get release output

e.g. helm get elasticsearch (replace elasticsearch with the name of your helm release)

Be careful to obfuscate every secrets (credentials, token, public IP, ...) that could be visible in the output before copy-pasting.

If you find some secrets in plain text in helm get release output you should use Kubernetes Secrets to managed them is a secure way (see Security Example).

Output of helm get release ``` $ helm get elasticsearch REVISION: 1 RELEASED: Fri Aug 14 14:15:21 2020 CHART: elasticsearch-7.7.1 USER-SUPPLIED VALUES: {} COMPUTED VALUES: antiAffinity: hard antiAffinityTopologyKey: kubernetes.io/hostname clusterHealthCheckParams: wait_for_status=green&timeout=1s clusterName: elasticsearch envFrom: [] esConfig: {} esJavaOpts: -Xmx1g -Xms1g esMajorVersion: "" extraContainers: [] extraEnvs: [] extraInitContainers: [] extraVolumeMounts: [] extraVolumes: [] fsGroup: "" fullnameOverride: "" httpPort: 9200 image: docker.elastic.co/elasticsearch/elasticsearch imagePullPolicy: IfNotPresent imagePullSecrets: [] imageTag: 7.7.1 ingress: annotations: {} enabled: false hosts: - chart-example.local path: / tls: [] initResources: {} keystore: [] labels: {} lifecycle: {} masterService: "" masterTerminationFix: false maxUnavailable: 1 minimumMasterNodes: 1 nameOverride: "" networkHost: 0.0.0.0 nodeAffinity: {} nodeGroup: master nodeSelector: {} persistence: annotations: {} enabled: true podAnnotations: {} podManagementPolicy: Parallel podSecurityContext: fsGroup: 1000 runAsUser: 1000 podSecurityPolicy: create: false name: "" spec: fsGroup: rule: RunAsAny privileged: true runAsUser: rule: RunAsAny seLinux: rule: RunAsAny supplementalGroups: rule: RunAsAny volumes: - secret - configMap - persistentVolumeClaim priorityClassName: "" protocol: http rbac: create: false serviceAccountName: "" readinessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 3 timeoutSeconds: 5 replicas: 1 resources: limits: cpu: 1000m memory: 2Gi requests: cpu: 1000m memory: 2Gi roles: data: "true" ingest: "true" master: "true" schedulerName: "" secretMounts: [] securityContext: capabilities: drop: - ALL runAsNonRoot: true runAsUser: 1000 service: annotations: {} httpPortName: http labels: {} labelsHeadless: {} loadBalancerIP: "" loadBalancerSourceRanges: [] nodePort: "" transportPortName: transport type: ClusterIP sidecarResources: {} sysctlInitContainer: enabled: true sysctlVmMaxMapCount: 262144 terminationGracePeriod: 120 tolerations: [] transportPort: 9300 updateStrategy: RollingUpdate volumeClaimTemplate: accessModes: - ReadWriteOnce resources: requests: storage: 30Gi HOOKS: --- # elasticsearch-lgztd-test apiVersion: v1 kind: Pod metadata: name: "elasticsearch-lgztd-test" annotations: "helm.sh/hook": test-success spec: securityContext: fsGroup: 1000 runAsUser: 1000 containers: - name: "elasticsearch-ctvif-test" image: "docker.elastic.co/elasticsearch/elasticsearch:7.7.1" command: - "sh" - "-c" - | #!/usr/bin/env bash -e curl -XGET --fail 'elasticsearch-master:9200/_cluster/health?wait_for_status=green&timeout=1s' restartPolicy: Never MANIFEST: --- # Source: elasticsearch/templates/poddisruptionbudget.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: "elasticsearch-master-pdb" spec: maxUnavailable: 1 selector: matchLabels: app: "elasticsearch-master" --- # Source: elasticsearch/templates/service.yaml kind: Service apiVersion: v1 metadata: name: elasticsearch-master labels: heritage: "Tiller" release: "elasticsearch" chart: "elasticsearch" app: "elasticsearch-master" annotations: {} spec: type: ClusterIP selector: heritage: "Tiller" release: "elasticsearch" chart: "elasticsearch" app: "elasticsearch-master" ports: - name: http protocol: TCP port: 9200 - name: transport protocol: TCP port: 9300 --- # Source: elasticsearch/templates/service.yaml kind: Service apiVersion: v1 metadata: name: elasticsearch-master-headless labels: heritage: "Tiller" release: "elasticsearch" chart: "elasticsearch" app: "elasticsearch-master" annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" spec: clusterIP: None # This is needed for statefulset hostnames like elasticsearch-0 to resolve # Create endpoints also if the related pod isn't ready publishNotReadyAddresses: true selector: app: "elasticsearch-master" ports: - name: http port: 9200 - name: transport port: 9300 --- # Source: elasticsearch/templates/statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch-master labels: heritage: "Tiller" release: "elasticsearch" chart: "elasticsearch" app: "elasticsearch-master" annotations: esMajorVersion: "7" spec: serviceName: elasticsearch-master-headless selector: matchLabels: app: "elasticsearch-master" replicas: 1 podManagementPolicy: Parallel updateStrategy: type: RollingUpdate volumeClaimTemplates: - metadata: name: elasticsearch-master spec: accessModes: - ReadWriteOnce resources: requests: storage: 30Gi template: metadata: name: "elasticsearch-master" labels: heritage: "Tiller" release: "elasticsearch" chart: "elasticsearch" app: "elasticsearch-master" annotations: spec: securityContext: fsGroup: 1000 runAsUser: 1000 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - "elasticsearch-master" topologyKey: kubernetes.io/hostname terminationGracePeriodSeconds: 120 volumes: initContainers: - name: configure-sysctl securityContext: runAsUser: 0 privileged: true image: "docker.elastic.co/elasticsearch/elasticsearch:7.7.1" imagePullPolicy: "IfNotPresent" command: ["sysctl", "-w", "vm.max_map_count=262144"] resources: {} containers: - name: "elasticsearch" securityContext: capabilities: drop: - ALL runAsNonRoot: true runAsUser: 1000 image: "docker.elastic.co/elasticsearch/elasticsearch:7.7.1" imagePullPolicy: "IfNotPresent" readinessProbe: exec: command: - sh - -c - | #!/usr/bin/env bash -e # If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=green&timeout=1s" ) # Once it has started only check that the node itself is responding START_FILE=/tmp/.es_start_file http () { local path="${1}" local args="${2}" set -- -XGET -s if [ "$args" != "" ]; then set -- "$@" $args fi if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then set -- "$@" -u "${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}" fi curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}" } if [ -f "${START_FILE}" ]; then echo 'Elasticsearch is already running, lets check the node is healthy' HTTP_CODE=$(http "/" "-w %{http_code}") RC=$? if [[ ${RC} -ne 0 ]]; then echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}" exit ${RC} fi # ready if HTTP code 200, 503 is tolerable if ES version is 6.x if [[ ${HTTP_CODE} == "200" ]]; then exit 0 elif [[ ${HTTP_CODE} == "503" && "7" == "6" ]]; then exit 0 else echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}" exit 1 fi else echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )' if http "/_cluster/health?wait_for_status=green&timeout=1s" "--fail" ; then touch ${START_FILE} exit 0 else echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )' exit 1 fi fi failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 3 timeoutSeconds: 5 ports: - name: http containerPort: 9200 - name: transport containerPort: 9300 resources: limits: cpu: 1000m memory: 2Gi requests: cpu: 1000m memory: 2Gi env: - name: node.name valueFrom: fieldRef: fieldPath: metadata.name - name: cluster.initial_master_nodes value: "elasticsearch-master-0," - name: discovery.seed_hosts value: "elasticsearch-master-headless" - name: cluster.name value: "elasticsearch" - name: network.host value: "0.0.0.0" - name: ES_JAVA_OPTS value: "-Xmx1g -Xms1g" - name: node.data value: "true" - name: node.ingest value: "true" - name: node.master value: "true" volumeMounts: - name: "elasticsearch-master" mountPath: /usr/share/elasticsearch/data ```

Describe the bug:

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
elasticsearch-master-0   0/1     Running   0          10m

$ kubectl describe pod elasticsearch-master-0
Events:
  Type     Reason                  Age                 From                                                 Message
  ----     ------                  ----                ----                                                 -------
  Normal   Scheduled               11m                 default-scheduler                                    Successfully assigned elk/elasticsearch-master-0 to ip-10-107-1-247.us-west-2.compute.internal
  Normal   SuccessfulAttachVolume  10m                 attachdetach-controller                              AttachVolume.Attach succeeded for volume "pvc-71b0adf3-3731-4891-97ef-83f4a192a929"
  Normal   Pulled                  10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.7.1" already present on machine
  Normal   Created                 10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Created container configure-sysctl
  Normal   Started                 10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Started container configure-sysctl
  Normal   Pulled                  10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.7.1" already present on machine
  Normal   Created                 10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Created container elasticsearch
  Normal   Started                 10m                 kubelet, ip-101-17-11-247.us-west-2.compute.internal  Started container elasticsearch
  Warning  Unhealthy               20s (x60 over 10m)  kubelet, ip-101-17-11-247.us-west-2.compute.internal  Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )

**Steps to reproduce:**

1.
helm install --name elasticsearch ./elasticsearch --namespace elk
2.
3.

**Expected behavior:**

**Provide logs and/or server output (if relevant):**

*Be careful to obfuscate every secrets (credentials, token, public IP, ...) that could be visible in the output before copy-pasting*


**Any additional context:**
melissajenner22 commented 4 years ago

I deleted 7.7.1, installed 7.8.1. I got same error.

Error: Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )

I use the chart, https://github.com/elastic/helm-charts, version 7.8.1 to install elasticsearch into kubernetes cluster. helm version: 2.16.10 Kubernetes version: 1.16 (EKS)

I use default settings, except changed the number of replicas from 3 to 1, changed minimumMasterNodes from 2 to 1 due to resource limitations.

git diff
--- a/elasticsearch/values.yaml
+++ b/elasticsearch/values.yaml

-replicas: 3
-minimumMasterNodes: 2
+replicas: 1
+minimumMasterNodes: 1

helm install --name elasticsearch ./elasticsearch --namespace elk

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
elasticsearch-master-0   0/1     Running   0          8m22s

kubectl describe pod elasticsearch-master-0

Events:
  Type     Reason                  Age               From                                                 Message
  ----     ------                  ----              ----                                                 -------
  Normal   Scheduled               45s               default-scheduler                                    Successfully assigned elk/elasticsearch-master-0 to ip-11-111-1-111.us-west-2.compute.internal
  Normal   SuccessfulAttachVolume  42s               attachdetach-controller                              AttachVolume.Attach succeeded for volume "pvc-71b0adf3-3731-4891-97ef-83f4a192a929"
  Normal   Pulled                  35s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.8.1" already present on machine
  Normal   Created                 35s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Created container configure-sysctl
  Normal   Started                 35s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Started container configure-sysctl
  Normal   Pulled                  35s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Container image "docker.elastic.co/elasticsearch/elasticsearch:7.8.1" already present on machine
  Normal   Created                 35s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Created container elasticsearch
  Normal   Started                 34s               kubelet, ip-11-111-1-111.us-west-2.compute.internal  Started container elasticsearch
  Warning  Unhealthy               5s (x2 over 17s)  kubelet, ip-11-111-1-111.us-west-2.compute.internal  Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )
vikas4cloud commented 4 years ago

is this happening if you edited anything in the helm-charts/elasticsearch/values.yaml file

melissajenner22 commented 4 years ago

Instead of use local source code I cloned from github.com/elastic/helm-charts, I use the command below to install elasticsearch,

$ helm repo add elastic https://helm.elastic.co
"elastic" has been added to your repositories

$ kubectl create namespace elk
$ helm install --name elasticsearch --version 7.8.1 elastic/elasticsearch --namespace elk

$ kubectl describe pod elasticsearch-master-0
  Warning  Unhealthy               3m10s (x2 over 3m20s)  kubelet, ip-10-117-56-142.us-west-2.compute.internal  Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )
jaykishan007 commented 4 years ago

I have encountered the same problem, is there any fix yet?

iantheparker commented 4 years ago

I'm experiencing the same.

This is my config that overrides some defaults

replicas: 1

persistence:
  enabled: false

resources:
  requests:
    cpu: "2"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "2Gi"

# Openshift overrides
# https://github.com/elastic/helm-charts/tree/master/elasticsearch/examples/openshift
securityContext:
  runAsUser: null

podSecurityContext:
  fsGroup: null
  runAsUser: null

sysctlInitContainer:
  enabled: false
nneul commented 4 years ago

How is the cluster supposed to go green with these charts on initial standup?

The headless service doesn't resolve in DNS until at least one node is up, but the readiness check doesn't return ready until the cluster is green. I didn't try lowering the minimum master nodes, but even with that, it's unclear how it should properly start back up when none of the pods are currently up and running -- both during an initial deployment and in future if you for example needed to fully shut down the cluster and bring it back up with same pv's.

On a handcrafted deployment I have - I wound up disabling the readiness check until the cluster was operational, and then rolling out an update re-enabling it.

Maybe I'm missing something and it's the same thing affecting the deployment in this issue?

If this is completely unrelated to this issue, please disregard, just seemed to have a likely overlap.

adinhodovic commented 4 years ago

If your running a single replica cluster add the following helm value:

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

Your status will never go green with a single replica cluster.

The following values should work:

replicas: 1
minimumMasterNodes: 1
clusterHealthCheckParams: 'wait_for_status=yellow&timeout=1s'
mykhailosmolynets commented 4 years ago

We have faced with the same issue in case of StatefulSet restoring with the same PVC(using it for test envs). It works ok if we crate elastic pod from scratch, but not after StatefulSet restoration. In couple with error Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" ) we get next error:

{"type": "server", "timestamp": "2020-10-01T14:09:37,285Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "**", "node.name": "**-0", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[**_**][0]]]).", "cluster.uuid": "**", "node.id": "****" }

The issue seems to be not in the elasticsearch pods - assume it's in the index itself. Even if we create index from the app with number_of_replicas parameter set to 0, after elasticsearch pod recreation with the same pvc we get this value set to 1. This command helps us to get elastic back to life:

curl -H 'Content-Type: application/json' -XPUT 'elastic_host:9200/index_name/_settings' -d '{ "index":{"number_of_replicas" : 0 }}

May be there are some default value for numbers of replicas, which applies in case of elasticsearch restart?

botelastic[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

obenziane2 commented 3 years ago

I'm having the same issue running it on AWS Fargate. The fact that we can not run it on a privileged mode, I had to disable these settings from values.yml:

sysctlInitContainer:
  enabled: false

But I end up with Max file descriptors being too low:

[2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Any suggestions to run it on Fargate ?

sidhuko commented 3 years ago

I was having the same issue. As our developers are coming from docker-compose I want to make the transition as smooth as possible so enabled the discovery.type=single-node option by using a singleNode: true property in the values.yaml.

https://github.com/elastic/helm-charts/pull/1027

Please test this is working for you as well and let me know any feedback. I've not written python in quite sometime!

byronmansfield commented 3 years ago

I am also running into the same issue. Though in my scenario I am using the multi approach used in their examples, and replica's of 2 with ES version 6.1.4. I tried updating the clusterHealthCheckParams: "wait_for_status=[yellow,green]&timeout=200s" with no avail. Nothing else coming out of the events or logs which raise any concern.

Nimesh36 commented 3 years ago

I was having the same issue. Earlier I have set a password with a length of fewer than 20 characters. but after setting a password with 20 character length in 100s pod status is turn in ready.

Nimesh36 commented 3 years ago

I was having the same issue. Earlier I have set a password with a length of fewer than 20 characters. but after setting a password with 20 character length in 100s pod status is turn in ready.

IsaackRasmussen commented 3 years ago

I can confirm adding a password completes the initialization. I didn't know of the 20 character limit and only had it by coincidence- good to know.

vitalyrychkov commented 3 years ago

Thank you, this could be a comment for the variable in the values.yaml

jmlrt commented 3 years ago

Could not reproduce the initial issue from @melissajenner22 on some EKS cluster with same config (except using 7.13.2 version for chart/Elasticsearch).

jmlrt commented 3 years ago

There seem to have a lot of different issues related to deploying single node cluster.

Note that using replicas: 1 value should be the only value required to deploy a single node cluster.

Most issues where cluster status never get green with replicas: 1 are because a cluster with more nodes was deployed previously using the same PVC.

Indeed helm delete doesn't delete the PVC, so if you have deployed a default 3-nodes cluster, then deleted it and redeployed a single-node cluster with the same name, it will reuse the disk of the first node of the 3-node cluster and will expect to have 2 other nodes using the other disks.

jmlrt commented 3 years ago

I'm closing this issue as I can't reproduce the original error and I think that https://github.com/elastic/helm-charts/issues/783#issuecomment-874316380 should solve most of the other error.

If you still have Readiness probe failed: Waiting for elasticsearch cluster to become ready errors with single node cluster not related to some previously existing PVC, please reopen a new bug report with all the details of your environments.

chance2021 commented 3 years ago

If your running a single replica cluster add the following helm value:

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

Your status will never go green with a single replica cluster.

The following values should work:

replicas: 1
minimumMasterNodes: 1
clusterHealthCheckParams: 'wait_for_status=yellow&timeout=1s'

This works for! Thanks! Just make sure to replace clusterHealthCheckParams setting, instead of copy & paste from the post, otherwise it will be overwritten by the original setting in the yaml file

dev852old commented 3 years ago

Try changing permission settings on the persistetvolume folder to 777. I was able to reproduce this with:

ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];