elastic / helm-charts

You know, for Kubernetes
Apache License 2.0
1.89k stars 1.93k forks source link

Elasticsearch Master - Cannot resolved elasticsearch-master-headless #145

Closed DandyDeveloper closed 5 years ago

DandyDeveloper commented 5 years ago

Chart version: latest

Kubernetes version: 1.12.7

Kubernetes provider: E.g. GKE (Google Kubernetes Engine) Bare Metal

Helm Version: 2.14

Values.yaml:

elasticsearch-master:
  enabled: true
  nodeSelector: {role: elasticsearch}
  roles:
    master: "true"
    ingest: "false"
    data: "false"

Describe the bug: After successfully deploying the 3 masters, I have removed one to test recovery, but the master node cannot recover.

The service is running and the other masters are running successfully but the deleted pod cannot resolve the headless DNS (or any for that matter):

Steps to reproduce:

  1. Helm install chart
  2. Wait for all masters to start (2/2)
  3. Delete a pod
  4. Pod will sit in 1/2 state, unable to resolve any hosts.

Expected behavior: The pod should recover successfully.

Provide logs and/or server output (if relevant):

"stacktrace": ["java.net.UnknownHostException: elasticsearch-master-headless",
"at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?]",
"at java.net.InetAddress.getAllByName0(InetAddress.java:1505) ~[?:?]",
"at java.net.InetAddress.getAllByName(InetAddress.java:1364) ~[?:?]",
"at java.net.InetAddress.getAllByName(InetAddress.java:1298) ~[?:?]",
"at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:536) ~[elasticsearch-7.1.0.jar:7.1.0]",
"at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:489) ~[elasticsearch-7.1.0.jar:7.1.0]",
"at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:744) ~[elasticsearch-7.1.0.jar:7.1.0]",
"at org.elasticsearch.discovery.SeedHostsResolver.lambda$resolveHostsLists$0(SeedHostsResolver.java:143) ~[elasticsearch-7.1.0.jar:7.1.0]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.0.jar:7.1.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }
{"type": "server", "timestamp": "2019-06-02T08:44:01,706+0000", "level": "WARN", "component": "o.e.d.SeedHostsResolver", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0",  "message": "failed to resolve host [elasticsearch-master-headless]" , 
Crazybus commented 5 years ago

Hey @DandyDeveloper!

This is a weird one. If it wasn't possible for the headless service to be resolvable then it shouldn't be possible for the cluster to form at all. So this is an odd place for it to start failing.

For what its worth I can't reproduce this on GKE (it would also be caught by our automated integration tests). It's also possible for the cluster to recover from deleting all the pods at once which I have tested a lot.

Some questions from my side:

  1. Can you give me the spec for one of the pods? I want to see if maybe your bare metal provider is injecting any funny DNS settings. I'd like to see the output of kubectl get pod elasticsearch-master-0 -o yaml
  2. The output of the helm release. helm get elasticsearch-master
  3. Can you check whether or not the Elasticsearch cluster is actually forming properly? By running curl localhost:9200/_cluster/health?pretty=true from inside of one of the pods.
  4. "Pod will sit in 1/2 state, unable to resolve any hosts.". Is this when the pod is shutting down or starting back up again?
  5. Output of ping elasticsearch-master-headless run form within one of the running containers.
DandyDeveloper commented 5 years ago

@Crazybus Sorry for not getting this to you sooner. I actually had a look at the specific node this was running against and it was infact an issue with the node itself.

Everything on the node was effective unable to resolve because it wasn't appropriately provisioned to access the kube router.

After fixing this and killing the pod, it started to behave. The data node also came up successfully. Sorry to have wasted your time with this.

Crazybus commented 5 years ago

Great to hear and thank you for following up!

100cm commented 5 years ago

same problem here ? how do u fix this?

DandyDeveloper commented 5 years ago

@100cm As mentioned before, our problem was on-prem networking issue, nothing to do with the chart itself.

daichi703n commented 4 years ago

I faced same issue. In my case, firewalld blocks DNS request. Disabling firewalld (or permit 53/udp,tcp) fixes this issue.