apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

SparkPi Example: java.nio.channels.UnresolvedAddressException #523

Closed apurvann closed 6 years ago

apurvann commented 7 years ago

I am trying to follow the official documentation for running the SparkPi Example and encountering the following problem.

We create the kubernetes cluster using Magnum (https://wiki.openstack.org/wiki/Magnum)

And run the example as per the following command:

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://<MY-KUBE-MASTER-IP>:6443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
  --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
    local:///opt/spark/examples/src/main/python/pi.py 10

Has anyone ran into this before? Which address does it refer to, considering I have already given it the master IP?

2017-10-11 16:35:52 INFO SparkContext:54 - Running Spark version 2.2.0-k8s-0.4.0 2017-10-11 16:35:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-10-11 16:35:52 INFO SparkContext:54 - Submitted application: PythonPi 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2017-10-11 16:35:53 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 2017-10-11 16:35:53 INFO SparkContext:54 - Successfully stopped SparkContext Traceback (most recent call last): File "/opt/spark/examples/src/main/python/pi.py", line 32, in <module> .appName("PythonPi")\ File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__ File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)

mccheah commented 7 years ago

In the driver pod, the driver attempts to connect o kubernetes.default.svc which in turn requires kube-dns to be installed.

mccheah commented 7 years ago

Also can we try this with Scala's SparkPi instead of Python?

kimoonkim commented 7 years ago

Just to add one more data point, I also saw this same error occurring during an integration test off #514. I ran it manually inside my IntelliJ and saw it happening. And the minikube had the kube-dns running.

$ kubectl logs -n 7f8bc76fe2084a6e9259bc491f85f3d9 spark-test-app-1507154737198-driver 2017-10-04 22:04:04 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Exception in thread “main” java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) 2017-10-04 22:04:04 INFO SparkContext:54 - Successfully stopped SparkContext at java.lang.Thread.run(Thread.java:748)

ifilonenko commented 7 years ago

Just to checkpoint a working state where I have not seen this error. Upon merging #414 this error did not exist. Seems to have been introduced via most recent pushes maybe: #483 @mccheah @foxish

mccheah commented 7 years ago

Can we double check that the headless service is actually being created in the cases where this error occurs? kubectl get svc, kubectl describe svc, etc.

kimoonkim commented 7 years ago

Sure, let me try to reproduce this and double-check if the headless service exists.

apurvann commented 7 years ago

@mccheah In my case, I do not have kube-dns running, I don't know how to enable it because we use Magnum for deploying the cluster. I did run the SparkPi example using Scala though, but no luck either.

mccheah commented 7 years ago

Can we see the stack trace for the Scala job?

mccheah commented 7 years ago

Actually, @kimoonkim has given us a stack trace already - would like to see if it's the same as what @apurva3000 is seeing.

SOmeONee commented 7 years ago

In DriverAddressConfigurationStep.scala, driver hostname is: val driverHostname = s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local" so, kubelet should use: cluster_domain= cluster.local

But I think this should be passed by config .

kimoonkim commented 7 years ago

I just tried to reproduce this in the same way as before, but this time it succeeded. Here's kubectl describe service output:

$ kubectl describe service -n 31c2f493de2641369b2761a0e44fa28c spark-test-app-1508285351080-driver-svc
Name:           spark-test-app-1508285351080-driver-svc
Namespace:      31c2f493de2641369b2761a0e44fa28c
Labels:         <none>
Annotations:        <none>
Selector:       spark-app-locator=05bb57e0f1084af2b2370d6c04d1c376,spark-app-selector=spark-37307be4c1664e9ea0ee6794c374817d,spark-role=driver
Type:           ClusterIP
IP:         None
Port:           driver-rpc-port 7078/TCP
Endpoints:      172.17.0.4:7078
Port:           blockmanager    7079/TCP
Endpoints:      172.17.0.4:7079
Session Affinity:   None
Events:         <none>

I wonder if we have a race condition here. If kube-dns is slow to add an entry for the headless service, then the driver fails while attempting to listen on the address. Maybe the driver should just use the underlying pod IP address instead of the service address for listening? (I don't know if this suggestion makes sense)

ifilonenko commented 7 years ago

re: @kimoonkim [@foxish doesn't the service address resolve to the appropriate pod IP address? And could it be the case that there is such a race condition? I don't understand also why it would sometimes fail and other times not]

apurvann commented 7 years ago

@mccheah I am seeing the same stack trace

2017-10-18 08:11:59 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
    at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
    at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
    at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
    at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:748)
2017-10-18 08:11:59 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
    at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
    at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
    at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
    at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
    at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
    at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:748)

Hope this helps.

liyinan926 commented 7 years ago

Regarding the race condition suggested by @kimoonkim, the submission client creates the driver pod before creating the headless service along with other Kubernetes resources the driver depends on. @apurva3000 as @mccheah suggested, can you run kubectl get svc to see if the service was indeed created when the error occurred?

kimoonkim commented 7 years ago

@liyinan926 Sure, that ordering exists. But does it mean the headless service DNS entry is actually created in kube-dns before the driver pod starts? The client code or even k8s API server may just send an asynchronous request to kube-dns without waiting for the dns entry create operation to complete.

liyinan926 commented 7 years ago

@kimoonkim AFAIK, the endpoints controller watches for services and create corresponding endpoints when applicable. In this case, the headless service has selector so the endpoint controller will create the endpoints once it sees the service.

kimoonkim commented 7 years ago

But the endpoint backing the service in this case is the driver pod itself. So is it possible for whoever watches the endpoint or kube-dns can be slower than the the driver JVM and create a DNS entry too late? It's not the first time that slow DNS affected us. I won't be surprised.

liyinan926 commented 7 years ago

@kimoonkim I think it's possible that the driver pod starts and tries to bind to the host name derived from the service name before the endpoint controller creates the endpoint and modifies the DNS configuration if the latter is slow.

apurvann commented 7 years ago

@liyinan926 Okay I tried describing the service as well and this is what I get:

Name:              spark-pi-1508314311201-driver-svc
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          spark-app-selector=spark-c3d3cf27405f4997ab445817ea0e1ec9,spark-role=driver
Type:              ClusterIP
IP:                None
Port:              driver-rpc-port  7078/TCP
TargetPort:        7078/TCP
Endpoints:         
Port:              blockmanager  7079/TCP
TargetPort:        7079/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>

So, clearly the endpoints are blank in my output, what exactly am I missing here, the fact that I do not have kube-dns running? (And also the fact that I do not know how to enable that in OpenStack Magnum)

kimoonkim commented 7 years ago

Oh, if kube-dns is not running, then I think this won't work at all. Curious how come your cluster does not have kube-dns. I thought kube-dns is a required component for k8s these days.

liyinan926 commented 7 years ago

Yes, without the kube-dns addon, this won't work. The driver won't be able to resolve the fully-qualified driver service name ({driver-service}.{namespace}.svc.cluster.local) to the corresponding IP address it is trying to bind to.

liyinan926 commented 7 years ago

@kimoonkim @mccheah @ifilonenko I think @kimoonkim's point above of using the underlying driver pod's IP address makes sense as anyway the derived driver host name (${driverService.getMetadata.getName}.$namespace.svc.cluster.local) resolves to that IP address. By using the IP address directly, we can avoid the possible race condition. With this, the submission client still creates the headless service, but sets spark.driver.bindAddress to the IP address of the driver pod. spark.driver.host will still use the derived host name though.