apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

org.apache.spark.SparkException: Executor cannot find driver pod #558

Open apurvann opened 7 years ago

apurvann commented 7 years ago

I am trying to run the SparkPi command on my kubernetes cluster which has kubernetes 1.8 and following the official documentation. The command for running looks like this:

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://http://localhost:8080 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.5.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.5.0 \
  --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar \
  local:///opt/spark/examples/src/main/python/pi.py 10

However, I am running into this error whenever I try to run the SparkPi example.

io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1511532181271-driver]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:74)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:72)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:142)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2764)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at okhttp3.Dns$1.lookup(Dns.java:39)
    at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
    at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
    at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
    at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
    at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
    at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    at okhttp3.RealCall.execute(RealCall.java:69)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    ... 18 more

2017-11-24 14:03:11 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:78)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:72)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:142)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2764)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1511532181271-driver]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:74)
    ... 16 more
Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at okhttp3.Dns$1.lookup(Dns.java:39)
    at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
    at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
    at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
    at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
    at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
    at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    at okhttp3.RealCall.execute(RealCall.java:69)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    ... 18 more

Anyone ran into this error before?

apurvann commented 7 years ago

I checked the pre-requisites. I have all the permissions with the default service account. Also, kube-dns is working perfectly fine. I have used https://github.com/kubernetes-incubator/kubespray to spin up my cluster. Do I need to provide some sort of token manually to make it run?

yanglei99 commented 7 years ago

I hit the same problem on my Kubernetes 1.7.4 environment.

I see that KUBERNETES_MASTER_INTERNAL_URL = "https://kubernetes.default.svc" is defined in https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/constants.scala and used directly when creating k8s client, https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala I wonder if we should allow configuration be passed using --conf to change from default value.

foxish commented 7 years ago

Thanks for the bug report. Does the kubernetes svc not exist in your cases in the default namespace, or is it that kube-dns doesn't resolve it?

yanglei99 commented 7 years ago

In my case, it is not verified error:

Caused by: javax.net.ssl.SSLPeerUnverifiedException: Hostname kubernetes.default.svc not verified: certificate: sha256/OgiOei5ibrXB0RwLonCjihq+GuPkGj5BqNkWC0K1u2k= DN: CN=172.21.0.1 subjectAltNames: [172.21.0.1, 10.176.215.15, 169.46.7.238]

yanglei99 commented 7 years ago

I realized that I hit this issue: https://github.com/apache-spark-on-k8s/spark/issues/543

liyinan926 commented 7 years ago

Found this https://github.com/fabric8io/fabric8/issues/6229.

liyinan926 commented 7 years ago

And this https://github.com/spring-cloud-incubator/spring-cloud-kubernetes/issues/137. Seems there's a work around by setting environment variable KUBERNETES_MASTER to kubernetes.default. Can you try setting it in the driver pod? You can use --conf spark.kubernetes.driverEnv.KUBERNETES_MASTER=kubernetes.default.

yanglei99 commented 6 years ago

@liyinan926. Your suggestion did not work. The code I linked above showing the Kubernetes Client is created from the constant KUBERNETES_MASTER_INTERNAL_URL = "https://kubernetes.default.svc" directly. Yes, if it can honor sparkConf.get, it will be much better.

val kubernetesClient = SparkKubernetesClientFactory.createKubernetesClient( KUBERNETES_MASTER_INTERNAL_URL, Some(sparkConf.get(KUBERNETES_NAMESPACE)), APISERVER_AUTH_DRIVER_MOUNTED_CONF_PREFIX, sparkConf, Some(new File(Config.KUBERNETES_SERVICE_ACCOUNT_TOKEN_PATH)), Some(new File(Config.KUBERNETES_SERVICE_ACCOUNT_CA_CRT_PATH)))

fanzhen commented 6 years ago

@yanglei99 ,Yes. @liyinan926 's suggestion did not work out. The KUBERNETES_MASTER_INTERNAL_URL is hardcoded for some convenience I think, but It doesn't work in my environment also. As far as I'm concerned, using a sparkConf.get method like "spark.master" is much better. I tried this way and it works well.

yanglei99 commented 6 years ago

@fanzhen, first I assume you make it to work by changing the code above and recompile. Second, I do like the internal URL approach as it is used in pod, than the spark-submit client. I wonder if a simple fix is to use "kubernetes.default.svc.cluster.local" as default, while allowing a spark config overwrite.

liyinan926 commented 6 years ago

As @yanglei99 said, the URL used by the pods to connect to the API server is internal, while spark.master is the external URL used by the submission client.

@apurva3000 @yanglei99 @fanzhen what is the DNS name of the Kubernetes service in your clusters?

liyinan926 commented 6 years ago

Hm, looks like even if we don't specify the master when creating the client, the fabric8 kubernetes-client will use https://kubernetes.default.svc by default. See https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L113. So a solution could be to not set the hard-coded master in the Config object used when creating the client. This gives users the option to use environment variable KUBERNETES_MASTER to set the correct value. @foxish @ash211 @mccheah @kimoonkim @erikerlandson.

yanglei99 commented 6 years ago

@liyinan926, "kubernetes.default.svc" is hardcoded today to create Kubernetes Client. Code links are in above discussions. In my environment it is resolved to "kubernetes.default.svc.cluster.local".

fanzhen commented 6 years ago

@liyinan926 , I agree with the solution you figured out. Giving users the option to use environment variable KUBERNETES_MASTER can be more flexible to different environments.

apurvann commented 6 years ago

I believe it should be kubernetes.default.svc.cluster.local looking at the following

kubectl exec -it busybox -- nslookup kubernetes.default.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

And

kubectl exec -it busybox -- cat /etc/resolv.conf
nameserver 10.233.0.3
search default.svc.cluster.local svc.cluster.local cluster.local openstacklocal
options ndots:5

So, I changed the code and built it again, so that it gets the URL from the sparkConf. However, am still hitting the same error:

Caused by: java.net.UnknownHostException: kubernetes.default.svc.cluster.local: Try again

yanglei99 commented 6 years ago

I realized that I can use the ip addresses, both the master url (https://ip:port) and the dns resolved ip (https://172.21.0.1) as the internal master URL. However, I can not use any of the host names. Besides, https://kubernetes.default.svc (in the current code), https://kubernetes.default.svc.cluster.local, I have also tried https://kubernetes, none worked.

apurvann commented 6 years ago

@yanglei99 are you able to use the kubernetes master URL as the internal URL i.e. <k8s-apiserver-host>:<k8s-apiserver-port> ?

I tried http://localhost:8080 , 127.0.0.1:8080 and also the DNS resolved IP. Now I keep getting java.net.ConnectException: Connection refused (Connection refused) errors

@liyinan926 @mccheah @foxish

yanglei99 commented 6 years ago

@apurva3000, I am not sure how you use the "master URL". First, in spark-submit, --master can use both proxy URL and the master URL, however as the driver code is running in pod, it can only use the master URL, as it does not have the submission environment's proxy setup. Second, when spark-submit is using master URL, in my environment, I also need to set spark.kubernetes.authenticate.submission.oauthToken. Then, as you mentioned using "http://localhost:8080" is it what is returned from "kubectl cluster-info" (e.g. Kubernetes master is running at http://localhost:8080") ? Help these helps.

apurvann commented 6 years ago

@yanglei99 Yes the kubectl cluster-info gives me http://localhost:8080 and I use it as my --master, as instructed in the official docs. However, using this or 127.0.0.1 with the port leads to the connection refused error.

yanglei99 commented 6 years ago

@apurva3000, can you share the new failure stack trace? One issue I see is , as the ip is localhost or 127.0.0.1, I am not sure driver pod will resolve it to the API server... One thing you can try is to use "http://10.233.0.1" for driver pod's KUBERNETES_MASTER_INTERNAL_URL as it is listed as the resolved IP address for your "kubernetes.default.svc.cluster.local". I assume it is using http protocol as your cluster-info returns http. However if that does not work try http://10.233.0.1. You will still need to use "--master k8s://http://localhost:8080" as it is for the submission side.

apurvann commented 6 years ago

@liyinan926 Still doesn't work and here is what the stacktrace looks like:


2017-12-07 08:38:48 ERROR KubernetesClusterSchedulerBackend:91 - Executor cannot find driver pod.
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1512635912762-driver]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:74)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:72)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:142)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2764)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124)
    at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:223)
    at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:149)
    at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
    at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
    at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    at okhttp3.RealCall.execute(RealCall.java:69)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    ... 18 more
2017-12-07 08:38:48 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:78)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:72)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:142)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2764)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1512635912762-driver]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:74)
    ... 16 more
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124)
    at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:223)
    at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:149)
    at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
    at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
    at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    at okhttp3.RealCall.execute(RealCall.java:69)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    ... 18 more
2017-12-07 08:38:48 INFO  AbstractConnector:310 - Stopped Spark@482e5524{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2017-12-07 08:38:48 INFO  SparkUI:54 - Stopped Spark web UI at http://spark-pi-1512635912762-driver-svc.default.svc.cluster.local:4040
2017-12-07 08:38:48 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2017-12-07 08:38:48 INFO  MemoryStore:54 - MemoryStore cleared
2017-12-07 08:38:48 INFO  BlockManager:54 - BlockManager stopped
2017-12-07 08:38:48 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2017-12-07 08:38:48 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is not running
2017-12-07 08:38:48 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2017-12-07 08:38:48 INFO  SparkContext:54 - Successfully stopped SparkContext
Traceback (most recent call last):
  File "/opt/spark/examples/src/main/python/pi.py", line 32, in <module>
    .appName("PythonPi")\
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:78)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:72)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:142)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2764)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1512635912762-driver]  in namespace: [default]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:74)
    ... 16 more
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at okhttp3.internal.platform.Platform.connectSocket(Platform.java:124)
    at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:223)
    at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:149)
    at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
    at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
    at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
    at okhttp3.RealCall.execute(RealCall.java:69)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
    ... 18 more

Am starting to wonder a few things now, when i do kubectl cluster-info, I get KubeDNS is running at http://localhost:8080/api/v1/namespaces/kube-system/services/kube-dns/proxy And when I try to do a curl at that address, I get this:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "no endpoints available for service \"kube-dns\"",
  "reason": "ServiceUnavailable",
  "code": 503
}

Do you know if this is normal, or do you also see a similar kind of message when you try to curl ?

liyinan926 commented 6 years ago

@apurva3000 it seems there's some problem with the service kube-dns in your cluster. Can you run kubectl get ep kube-dns --namespace=kube-system and see what the output looks like?

apurvann commented 6 years ago

I was able to get the endpoints as well, now I have destroyed that cluster and created a new one. There might have been some problems with kube-dns there for sure because now everything works

liyinan926 commented 6 years ago

Cool. The kube-dns add-on is required. I think we need to add a check to the submission process and fail the submission if the add-on does not exist or is not healthy.