apache-spark-on-k8s / kubernetes-HDFS

Repository holding configuration files for running an HDFS cluster in Kubernetes
Apache License 2.0
398 stars 185 forks source link

Unresolved address; Host Details : local host is: "hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local"; destination host is: (unknown):0; #53

Open ALiBaBa-Jimmy opened 6 years ago

ALiBaBa-Jimmy commented 6 years ago

When I use helm install the namenode accroding to your docs

Errors log appear follow, and namode restart again:

java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773) at org.apache.hadoop.ipc.Server.bind(Server.java:425) at org.apache.hadoop.ipc.Server$Listener.(Server.java:574) at org.apache.hadoop.ipc.Server.(Server.java:2215) at org.apache.hadoop.ipc.RPC$Server.(RPC.java:938) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:534) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:783) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:344) at org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:673) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:811) at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:795) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554) Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:408) ... 13 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 14 more

hdfs-namenode-0 0/1 CrashLoopBackOff 22 9h 10.196.36.165 10.196.36.165 hdfs-namenode-1 0/1 CrashLoopBackOff 22 9h 10.196.36.162 10.196.36.162

this is my hosts on my node machine: [wangdanfeng5@A01-R20-I36-165-0964488 ~]$ cat /etc/hosts

#127.0.0.1 A01-R20-I36-165-0964488.JD.LOCAL localhost.localdomain localhost 127.0.0.1 localhost.localdomain localhost 10.196.36.162 A01-R20-I36-162-0964483.JD.LOCAL 10.196.36.165 A01-R20-I36-165-0964488.JD.LOCAL

Could you give me some advice about this ?

ALiBaBa-Jimmy commented 6 years ago

@kimoonkim

kimoonkim commented 6 years ago

Hi @ALiBaBa-Jimmy, thanks for trying out k8s HDFS and sorry about the trouble you went through.

This seems like kube-dns issue. Do you know if your cluster has healthy kube-dns? If you're not clear, you may want to try steps in https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#does-the-service-work-by-dns

Also, can you post the exact command line you used to launch the helm chart?

Thanks.

nenggangpan commented 5 years ago

@kimoonkim I met exactly the same issue and I am pretty sure my dns is correct. my k8s version is 1.12, and the dns is core-dns.

cosmin-ionita commented 5 years ago

I have the exact same issue

grmaltby commented 5 years ago

I encountered this same issue. In my case it was caused by the cluster domainname not being the default/common "cluster", which appears to be the expectation in file charts/hdfs-k8s/templates/_helpers.tpl. My "fix" replaced one hardcoded value another:

--- a/charts/hdfs-k8s/templates/_helpers.tpl
+++ b/charts/hdfs-k8s/templates/_helpers.tpl
@@ -163,7 +163,7 @@ The HDFS config file should specify FQDN of services. Otherwise, Kerberos
 login may fail.
 */}}
 {{- define "svc-domain" -}}
-{{- printf "%s.svc.cluster.local" .Release.Namespace -}}
+{{- printf "%s.svc.aiscluster.local" .Release.Namespace -}}
 {{- end -}}

The incorrect value contributes to the hdfs-config configmap and the default custom core-site.xml it delivers (which run.sh unconditionally copies over any other tweaks to /etc/hadoop/core-site.xml)