Open ibalat opened 2 weeks ago
I am also facing this issue just after disabling the xpack.security.http.ssl.enabled. and the readiness probe is failing with connection error.
nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused
@ibalat the container ports are irrelevant for the readiness check. It uses port 8080 on localhost. Don't update the readiness port script, don't override the readiness probe. See the guidance in our docs https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-readiness.html#k8s_elasticsearch_versions_8_2_0_and_later
@ashuec90 this likely just means that your Elasticsearch pod is not ready. Check the logs.
@pebrc but no exposed 8080 port in container, how it check non-exists port? 8080 check return connection refused. Container open ports list at below:
127.0.0.1:57592 - 127.0.0.1:9200
127.0.0.1:34228 - 127.0.0.1:9200
127.0.0.1:37662 - 127.0.0.1:9200
127.0.0.1:57608 - 127.0.0.1:9200
127.0.0.1:52186 - 127.0.0.1:9200
127.0.0.1:34212 - 127.0.0.1:9200
127.0.0.1:38878 - 127.0.0.1:9200
127.0.0.1:38888 - 127.0.0.1:9200
127.0.0.1:59948 - 127.0.0.1:9200
127.0.0.1:52174 - 127.0.0.1:9200
127.0.0.1:37654 - 127.0.0.1:9200
127.0.0.1:54398 - 127.0.0.1:9200
@pebrc The document says not to override the readiness probe, but I resolved the issue by overriding it as shown below.
- name: elasticsearch
readinessProbe:
exec:
command:
- bash
- -c
- |
nc -z -v -w5 127.0.0.1 9200 #port change
@Mason0920 this is not doing what you think it is doing and you are putting the safe operation of the ECK operator at risk with this change.
Elasticsearch will respond on 9200 as soon as the process is running. A main motivation for the new readiness probe was to make sure that we consider nodes only ready when they actually have joined the cluster. Checking 9200 does not guarantee that. See the the following issues for more context https://github.com/elastic/cloud-on-k8s/issues/916 and https://github.com/elastic/cloud-on-k8s/issues/5449
@ibalat 8080 does not need to be exposed as a container port because this is an exec
probe that runs locally inside the runnig container and uses localhost as mentioned before.
@pebrc I have only data pod and this pod's container has not 8080 port. Okey, I am not waiting 8080 to see at exposed port list but at this situation, how can 8080 check is available? Am I missing something? And my cluster is running properly
Image version: 8.13.4
127.0.0.1:57592 - 127.0.0.1:9200
127.0.0.1:34228 - 127.0.0.1:9200
127.0.0.1:37662 - 127.0.0.1:9200
127.0.0.1:57608 - 127.0.0.1:9200
127.0.0.1:52186 - 127.0.0.1:9200
@ibalat I cannot reproduce your issue. We are testing ECK on a continuous basis in CI and don't have any related issues. From what you have shared so far I am sceptical that we have a bug here. Can you share the YAML file you used to create the Elasticsearch cluster please, just in case I missed something?
sure, I can share.
Elastic setup:
resource "kubectl_manifest" "elasticsearch_dev" {
yaml_body = file("${path.module}/config/elasticsearch/elasticsearch.yaml")
}
elasticsearch.yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-dev
namespace: elasticsearch
spec:
version: 8.13.4
nodeSets:
- name: data
count: 1
config:
node.store.allow_mmap: false
xpack.security.enabled: false
xpack.ml.enabled: false
xpack.graph.enabled: false
xpack.watcher.enabled: false
bootstrap.memory_lock: true
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: gp3
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 2800Mi
cpu: "1"
limits:
memory: 2800Mi
cpu: "1200m"
nodeSelector:
elasticsearch: green
tolerations:
- key: elasticsearch
value: green
effect: NoSchedule
http:
tls:
selfSignedCertificate:
disabled: true
service:
spec:
type: NodePort
ECK operator:
eck_operator_helm_config = {
max_history = 3
version = "2.14.0"
create_namespace = true
values = [file("${path.module}/config/eck/eck-operator.yaml")]
}
eck-operator.yaml
image:
pullPolicy: "Always"
Have the same problem. I ran three replicas. One pod is down because this readiness probe. nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused
. I setuped with just two command. No other customizations.
helm install elastic-operator elastic/eck-operator -n elastic-stack --create-namespace --set=installCRDs=true
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
namespace: elastic-stack
spec:
version: 8.15.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
According this post https://www.elastic.co/blog/red-elasticsearch-cluster-panic-no-longer, I got some sessage:
curl -u "elastic:$PASSWORD" -k "https://localhost:9200/_cluster/allocation/explain?pretty"
{
"note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
"index" : ".ds-filebeat-8.15.0-2024.09.09-000001",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2024-09-12T02:15:15.456Z",
"details" : "node_left [0IKKAc4kRZm2qfN40WlJVA]",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "Elasticsearch can't allocate this shard because all the copies of its data in the cluster are stale or corrupt. Elasticsearch will allocate this shard when a node containing a good copy of its data joins the cluster. If no such node is available, restore this index from a recent snapshot.",
"node_allocation_decisions" : [
{
"node_id" : "9o1iHmAGTo-vcZs4s3SIew",
"node_name" : "quickstart-es-default-2",
"transport_address" : "10.244.1.78:9300",
"node_attributes" : {
"ml.allocated_processors_double" : "4.0",
"ml.machine_memory" : "2147483648",
"ml.config_version" : "12.0.0",
"ml.max_jvm_size" : "1073741824",
"ml.allocated_processors" : "4",
"xpack.installed" : "true",
"transform.config_version" : "10.0.0",
"k8s_node_name" : "worker03"
},
"roles" : [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"node_decision" : "no",
"store" : {
"in_sync" : false,
"allocation_id" : "28a5a2__RVyR_VUV0ahEVA"
}
},
{
"node_id" : "_kAiqcc0RwCOysr3fypbsA",
"node_name" : "quickstart-es-default-1",
"transport_address" : "10.244.3.238:9300",
"node_attributes" : {
"ml.allocated_processors_double" : "4.0",
"ml.machine_memory" : "2147483648",
"ml.config_version" : "12.0.0",
"ml.max_jvm_size" : "1073741824",
"ml.allocated_processors" : "4",
"xpack.installed" : "true",
"transform.config_version" : "10.0.0",
"k8s_node_name" : "worker01"
},
"roles" : [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"node_decision" : "no",
"store" : {
"in_sync" : false,
"allocation_id" : "auY6WVS9SJKipfWHRD2Xqg"
}
}
]
}
I found the root cause. After I restart the issue pod, I got the logs Failed to load persistent cache
. And according this post https://opster.com/analysis/elasticsearch-failed-to-close-persistent-cache-index/, it mentioned some disk space clues. And I remeber that the initial PV size is only 1GB. After I increace the PV size. The pod is up and running!!!
@harveyjing I don't have any disk or another problem and my cluster is green, everything is ok, all pods are running, no error messages, although, 8080 port is not exists. I added some cluster details below:
nc -z -v -w5 127.0.0.1 8080
nc: connect to 127.0.0.1 port 8080 (tcp) failed: Connection refused
/_cluster/health?filter_path=status,*_shards
:
{"status":"green","active_primary_shards":1610,"active_shards":1999,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0}
I can connect this port. My version is : NAMESPACE NAME HEALTH NODES VERSION PHASE AGE elastic-stack quickstart green 3 8.15.0 Ready 9d @ibalat
Hi, I have upgraded 2.14 and it uses new readiness script checks 8080 port. But, we use 9200 port for elasticsearch container. How can I customize the script or is there another solution to solve the problem? I had updated readiness-port-script.sh with old readiness-probe-script.sh but it is not working because of missing env. At last status, pod is NotReady.
readiness-port-script.sh:
nc -z -v -w5 127.0.0.1 8080
Old version had used probe script but it is not working now because of missing environment variables (#8009)