elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.57k stars 695 forks source link

Readiness probe not working with custom container port #8042

Open ibalat opened 2 weeks ago

ibalat commented 2 weeks ago

Hi, I have upgraded 2.14 and it uses new readiness script checks 8080 port. But, we use 9200 port for elasticsearch container. How can I customize the script or is there another solution to solve the problem? I had updated readiness-port-script.sh with old readiness-probe-script.sh but it is not working because of missing env. At last status, pod is NotReady.

readiness-port-script.sh: nc -z -v -w5 127.0.0.1 8080

Image

Old version had used probe script but it is not working now because of missing environment variables (#8009) Image

ashuec90 commented 1 week ago

I am also facing this issue just after disabling the xpack.security.http.ssl.enabled. and the readiness probe is failing with connection error. nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

pebrc commented 1 week ago

@ibalat the container ports are irrelevant for the readiness check. It uses port 8080 on localhost. Don't update the readiness port script, don't override the readiness probe. See the guidance in our docs https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-readiness.html#k8s_elasticsearch_versions_8_2_0_and_later

@ashuec90 this likely just means that your Elasticsearch pod is not ready. Check the logs.

ibalat commented 1 week ago

@pebrc but no exposed 8080 port in container, how it check non-exists port? 8080 check return connection refused. Container open ports list at below:

127.0.0.1:57592 - 127.0.0.1:9200
127.0.0.1:34228 - 127.0.0.1:9200
127.0.0.1:37662 - 127.0.0.1:9200
127.0.0.1:57608 - 127.0.0.1:9200
127.0.0.1:52186 - 127.0.0.1:9200
127.0.0.1:34212 - 127.0.0.1:9200
127.0.0.1:38878 - 127.0.0.1:9200
127.0.0.1:38888 - 127.0.0.1:9200
127.0.0.1:59948 - 127.0.0.1:9200
127.0.0.1:52174 - 127.0.0.1:9200
127.0.0.1:37654 - 127.0.0.1:9200
127.0.0.1:54398 - 127.0.0.1:9200
Mason0920 commented 1 week ago

@pebrc The document says not to override the readiness probe, but I resolved the issue by overriding it as shown below.

        - name: elasticsearch
          readinessProbe:
            exec:
              command:
                - bash
                - -c
                - |
                  nc -z -v -w5 127.0.0.1 9200 #port change
pebrc commented 1 week ago

@Mason0920 this is not doing what you think it is doing and you are putting the safe operation of the ECK operator at risk with this change.

Elasticsearch will respond on 9200 as soon as the process is running. A main motivation for the new readiness probe was to make sure that we consider nodes only ready when they actually have joined the cluster. Checking 9200 does not guarantee that. See the the following issues for more context https://github.com/elastic/cloud-on-k8s/issues/916 and https://github.com/elastic/cloud-on-k8s/issues/5449

@ibalat 8080 does not need to be exposed as a container port because this is an exec probe that runs locally inside the runnig container and uses localhost as mentioned before.

ibalat commented 1 week ago

@pebrc I have only data pod and this pod's container has not 8080 port. Okey, I am not waiting 8080 to see at exposed port list but at this situation, how can 8080 check is available? Am I missing something? And my cluster is running properly

Image version: 8.13.4

Image

Image

127.0.0.1:57592 - 127.0.0.1:9200
127.0.0.1:34228 - 127.0.0.1:9200
127.0.0.1:37662 - 127.0.0.1:9200
127.0.0.1:57608 - 127.0.0.1:9200
127.0.0.1:52186 - 127.0.0.1:9200
pebrc commented 1 week ago

@ibalat I cannot reproduce your issue. We are testing ECK on a continuous basis in CI and don't have any related issues. From what you have shared so far I am sceptical that we have a bug here. Can you share the YAML file you used to create the Elasticsearch cluster please, just in case I missed something?

ibalat commented 1 week ago

sure, I can share.

Elastic setup:

resource "kubectl_manifest" "elasticsearch_dev" {
  yaml_body = file("${path.module}/config/elasticsearch/elasticsearch.yaml")
}

elasticsearch.yaml

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata: 
  name: elasticsearch-dev
  namespace: elasticsearch
spec:
  version: 8.13.4
  nodeSets:
  - name: data
    count: 1
    config:
      node.store.allow_mmap: false
      xpack.security.enabled: false
      xpack.ml.enabled: false
      xpack.graph.enabled: false
      xpack.watcher.enabled: false
      bootstrap.memory_lock: true
    volumeClaimTemplates:
    - metadata: 
        name: elasticsearch-data
      spec: 
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: gp3
    podTemplate:
      spec: 
        containers:
        - name: elasticsearch
           resources: 
            requests: 
              memory: 2800Mi
              cpu: "1"
            limits: 
              memory: 2800Mi
              cpu: "1200m"
        nodeSelector:
            elasticsearch: green
        tolerations: 
        - key: elasticsearch
          value: green
          effect: NoSchedule
  http: 
    tls:
      selfSignedCertificate: 
        disabled: true
    service:
      spec: 
        type: NodePort

ECK operator:

eck_operator_helm_config = {
    max_history      = 3
    version          = "2.14.0"
    create_namespace = true
    values           = [file("${path.module}/config/eck/eck-operator.yaml")]
  }

eck-operator.yaml

image:
  pullPolicy: "Always"
harveyjing commented 6 days ago

Have the same problem. I ran three replicas. One pod is down because this readiness probe. nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused. I setuped with just two command. No other customizations.

helm install elastic-operator elastic/eck-operator -n elastic-stack --create-namespace --set=installCRDs=true

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
  namespace: elastic-stack
spec:
  version: 8.15.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false

According this post https://www.elastic.co/blog/red-elasticsearch-cluster-panic-no-longer, I got some sessage:

curl -u "elastic:$PASSWORD" -k "https://localhost:9200/_cluster/allocation/explain?pretty"
{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : ".ds-filebeat-8.15.0-2024.09.09-000001",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2024-09-12T02:15:15.456Z",
    "details" : "node_left [0IKKAc4kRZm2qfN40WlJVA]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "Elasticsearch can't allocate this shard because all the copies of its data in the cluster are stale or corrupt. Elasticsearch will allocate this shard when a node containing a good copy of its data joins the cluster. If no such node is available, restore this index from a recent snapshot.",
  "node_allocation_decisions" : [
    {
      "node_id" : "9o1iHmAGTo-vcZs4s3SIew",
      "node_name" : "quickstart-es-default-2",
      "transport_address" : "10.244.1.78:9300",
      "node_attributes" : {
        "ml.allocated_processors_double" : "4.0",
        "ml.machine_memory" : "2147483648",
        "ml.config_version" : "12.0.0",
        "ml.max_jvm_size" : "1073741824",
        "ml.allocated_processors" : "4",
        "xpack.installed" : "true",
        "transform.config_version" : "10.0.0",
        "k8s_node_name" : "worker03"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "28a5a2__RVyR_VUV0ahEVA"
      }
    },
    {
      "node_id" : "_kAiqcc0RwCOysr3fypbsA",
      "node_name" : "quickstart-es-default-1",
      "transport_address" : "10.244.3.238:9300",
      "node_attributes" : {
        "ml.allocated_processors_double" : "4.0",
        "ml.machine_memory" : "2147483648",
        "ml.config_version" : "12.0.0",
        "ml.max_jvm_size" : "1073741824",
        "ml.allocated_processors" : "4",
        "xpack.installed" : "true",
        "transform.config_version" : "10.0.0",
        "k8s_node_name" : "worker01"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "auY6WVS9SJKipfWHRD2Xqg"
      }
    }
  ]
}
harveyjing commented 6 days ago

I found the root cause. After I restart the issue pod, I got the logs Failed to load persistent cache. And according this post https://opster.com/analysis/elasticsearch-failed-to-close-persistent-cache-index/, it mentioned some disk space clues. And I remeber that the initial PV size is only 1GB. After I increace the PV size. The pod is up and running!!!

ibalat commented 2 hours ago

@harveyjing I don't have any disk or another problem and my cluster is green, everything is ok, all pods are running, no error messages, although, 8080 port is not exists. I added some cluster details below:

nc -z -v -w5 127.0.0.1 8080
nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused

/_cluster/health?filter_path=status,*_shards:

{"status":"green","active_primary_shards":1610,"active_shards":1999,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0}
harveyjing commented 1 hour ago

Image I can connect this port. My version is : NAMESPACE NAME HEALTH NODES VERSION PHASE AGE elastic-stack quickstart green 3 8.15.0 Ready 9d @ibalat