Closed apurvann closed 6 years ago
In the driver pod, the driver attempts to connect o kubernetes.default.svc
which in turn requires kube-dns
to be installed.
Also can we try this with Scala's SparkPi
instead of Python?
Just to add one more data point, I also saw this same error occurring during an integration test off #514. I ran it manually inside my IntelliJ and saw it happening. And the minikube had the kube-dns
running.
$ kubectl logs -n 7f8bc76fe2084a6e9259bc491f85f3d9 spark-test-app-1507154737198-driver 2017-10-04 22:04:04 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Exception in thread “main” java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) 2017-10-04 22:04:04 INFO SparkContext:54 - Successfully stopped SparkContext at java.lang.Thread.run(Thread.java:748)
Just to checkpoint a working state where I have not seen this error. Upon merging #414 this error did not exist. Seems to have been introduced via most recent pushes maybe: #483 @mccheah @foxish
Can we double check that the headless service is actually being created in the cases where this error occurs? kubectl get svc
, kubectl describe svc
, etc.
Sure, let me try to reproduce this and double-check if the headless service exists.
@mccheah In my case, I do not have kube-dns running, I don't know how to enable it because we use Magnum for deploying the cluster. I did run the SparkPi example using Scala though, but no luck either.
Can we see the stack trace for the Scala job?
Actually, @kimoonkim has given us a stack trace already - would like to see if it's the same as what @apurva3000 is seeing.
In DriverAddressConfigurationStep.scala, driver hostname is:
val driverHostname = s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local"
so, kubelet should use:
cluster_domain= cluster.local
But I think this should be passed by config .
I just tried to reproduce this in the same way as before, but this time it succeeded. Here's kubectl describe service
output:
$ kubectl describe service -n 31c2f493de2641369b2761a0e44fa28c spark-test-app-1508285351080-driver-svc
Name: spark-test-app-1508285351080-driver-svc
Namespace: 31c2f493de2641369b2761a0e44fa28c
Labels: <none>
Annotations: <none>
Selector: spark-app-locator=05bb57e0f1084af2b2370d6c04d1c376,spark-app-selector=spark-37307be4c1664e9ea0ee6794c374817d,spark-role=driver
Type: ClusterIP
IP: None
Port: driver-rpc-port 7078/TCP
Endpoints: 172.17.0.4:7078
Port: blockmanager 7079/TCP
Endpoints: 172.17.0.4:7079
Session Affinity: None
Events: <none>
I wonder if we have a race condition here. If kube-dns is slow to add an entry for the headless service, then the driver fails while attempting to listen on the address. Maybe the driver should just use the underlying pod IP address instead of the service address for listening? (I don't know if this suggestion makes sense)
re: @kimoonkim [@foxish doesn't the service address resolve to the appropriate pod IP address? And could it be the case that there is such a race condition? I don't understand also why it would sometimes fail and other times not]
@mccheah I am seeing the same stack trace
2017-10-18 08:11:59 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
2017-10-18 08:11:59 INFO SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
Hope this helps.
Regarding the race condition suggested by @kimoonkim, the submission client creates the driver pod before creating the headless service along with other Kubernetes resources the driver depends on. @apurva3000 as @mccheah suggested, can you run kubectl get svc
to see if the service was indeed created when the error occurred?
@liyinan926 Sure, that ordering exists. But does it mean the headless service DNS entry is actually created in kube-dns
before the driver pod starts? The client code or even k8s API server may just send an asynchronous request to kube-dns
without waiting for the dns entry create operation to complete.
@kimoonkim AFAIK, the endpoints controller watches for services and create corresponding endpoints when applicable. In this case, the headless service has selector so the endpoint controller will create the endpoints once it sees the service.
But the endpoint backing the service in this case is the driver pod itself. So is it possible for whoever watches the endpoint or kube-dns can be slower than the the driver JVM and create a DNS entry too late? It's not the first time that slow DNS affected us. I won't be surprised.
@kimoonkim I think it's possible that the driver pod starts and tries to bind to the host name derived from the service name before the endpoint controller creates the endpoint and modifies the DNS configuration if the latter is slow.
@liyinan926 Okay I tried describing the service as well and this is what I get:
Name: spark-pi-1508314311201-driver-svc
Namespace: default
Labels: <none>
Annotations: <none>
Selector: spark-app-selector=spark-c3d3cf27405f4997ab445817ea0e1ec9,spark-role=driver
Type: ClusterIP
IP: None
Port: driver-rpc-port 7078/TCP
TargetPort: 7078/TCP
Endpoints:
Port: blockmanager 7079/TCP
TargetPort: 7079/TCP
Endpoints:
Session Affinity: None
Events: <none>
So, clearly the endpoints are blank in my output, what exactly am I missing here, the fact that I do not have kube-dns running? (And also the fact that I do not know how to enable that in OpenStack Magnum)
Oh, if kube-dns is not running, then I think this won't work at all. Curious how come your cluster does not have kube-dns. I thought kube-dns is a required component for k8s these days.
Yes, without the kube-dns addon, this won't work. The driver won't be able to resolve the fully-qualified driver service name ({driver-service}.{namespace}.svc.cluster.local
) to the corresponding IP address it is trying to bind to.
@kimoonkim @mccheah @ifilonenko I think @kimoonkim's point above of using the underlying driver pod's IP address makes sense as anyway the derived driver host name (${driverService.getMetadata.getName}.$namespace.svc.cluster.local
) resolves to that IP address. By using the IP address directly, we can avoid the possible race condition. With this, the submission client still creates the headless service, but sets spark.driver.bindAddress
to the IP address of the driver pod. spark.driver.host
will still use the derived host name though.
I am trying to follow the official documentation for running the SparkPi Example and encountering the following problem.
We create the kubernetes cluster using Magnum (https://wiki.openstack.org/wiki/Magnum)
And run the example as per the following command:
Has anyone ran into this before? Which address does it refer to, considering I have already given it the master IP?
2017-10-11 16:35:52 INFO SparkContext:54 - Running Spark version 2.2.0-k8s-0.4.0 2017-10-11 16:35:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-10-11 16:35:52 INFO SparkContext:54 - Submitted application: PythonPi 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2017-10-11 16:35:53 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 2017-10-11 16:35:53 INFO SparkContext:54 - Successfully stopped SparkContext Traceback (most recent call last): File "/opt/spark/examples/src/main/python/pi.py", line 32, in <module> .appName("PythonPi")\ File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__ File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)