Open adelbertc opened 6 years ago
As suspected it was due to the driver container not exposing the ports on creation. While the service selected the driver pod as expected, the driver did not have any ports open for the service to proxy requests into.
Filed PR: https://github.com/apache-spark-on-k8s/spark/pull/618
We've seen a similar issue under certain network overlay configurations. In our case the problem seemed to be that because only a headless service gets created with no IP address the cluster hostname for the service would not resolve but just the service name itself did. This only happened under Romana as it would work fine under other network overlays e.g. Flannel so this issue may be caused by specific network overlays
Here's our stack trace for comparison:
2018-02-08 19:24:04 WARN TransportClientFactory:166 - DNS resolution for spark-pi-1518117806498-driver-svc.default.svc.cluster.local:7078 took 5008 ms
--
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to spark-pi-1518117806498-driver-svc.default.svc.cluster.local:7078
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
I'm seeing the exact same thing on Kubernetes 1.9 with Flannel, using spark v2.2.0-kubernetes-0.5.0
Command Run:
spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://x.x.x.x:6443 \
--kubernetes-namespace default \
--conf spark.executor.instances=5 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
$ kubectl describe service spark-pi-1520454766917-driver-svc
Name: spark-pi-1520454766917-driver-svc
Namespace: default
Labels: <none>
Annotations: <none>
Selector: spark-app-selector=spark-97c6f27b591a48c88c2a7ae9a64f5af2,spark-role=driver
Type: ClusterIP
IP: None
Port: driver-rpc-port 7078/TCP
TargetPort: 7078/TCP
Endpoints: 10.244.4.16:7078
Port: blockmanager 7079/TCP
TargetPort: 7079/TCP
Endpoints: 10.244.4.16:7079
Session Affinity: None
Events: <none>
$ kubectl describe pod spark-pi-1520454766917-driver
Name: spark-pi-1520454766917-driver
Namespace: default
Node: rd-cluster-5/10.176.49.113
Start Time: Wed, 07 Mar 2018 15:32:48 -0500
Labels: spark-app-selector=spark-97c6f27b591a48c88c2a7ae9a64f5af2
spark-role=driver
Annotations: spark-app-name=spark-pi
Status: Running
IP: 10.244.4.16
Containers:
spark-kubernetes-driver:
Container ID: docker://07e7d70d94c1df7b24323e202f4c9f2ac0cbc18402f8c67f56f575909f385856
Image: kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
Image ID: docker-pullable://kubespark/spark-driver@sha256:7539c28c4dbfc55addf5c3f439d26b964c4b04cd780a066c708242867356eb0b
Port: <none>
State: Running
Started: Wed, 07 Mar 2018 15:32:50 -0500
Ready: True
Restart Count: 0
Limits:
memory: 1408Mi
Requests:
cpu: 1
memory: 1Gi
$ kubectl logs spark-pi-1520454766917-driver
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi && if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi && ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS'
2018-03-07 20:32:50 INFO SparkContext:54 - Running Spark version 2.2.0-k8s-0.5.0
2018-03-07 20:32:51 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-07 20:32:51 WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2018-03-07 20:32:51 INFO SparkContext:54 - Submitted application: Spark Pi
2018-03-07 20:32:51 INFO SecurityManager:54 - Changing view acls to: root
2018-03-07 20:32:51 INFO SecurityManager:54 - Changing modify acls to: root
2018-03-07 20:32:51 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-07 20:32:51 INFO SecurityManager:54 - Changing modify acls groups to:
2018-03-07 20:32:51 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-03-07 20:32:51 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2018-03-07 20:32:51 INFO SparkEnv:54 - Registering MapOutputTracker
2018-03-07 20:32:51 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-03-07 20:32:51 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-03-07 20:32:51 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-03-07 20:32:51 INFO DiskBlockManager:54 - Created local directory at /mnt/tmp/spark-local/spark-38688487-2c90-4b12-a24d-6a1ebb69dae1/blockmgr-addce8bd-8f9c-4274-929a-bd6020efec55
2018-03-07 20:32:51 INFO MemoryStore:54 - MemoryStore started with capacity 408.9 MB
2018-03-07 20:32:51 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-03-07 20:32:51 INFO log:192 - Logging initialized @1738ms
2018-03-07 20:32:51 INFO Server:345 - jetty-9.3.z-SNAPSHOT
2018-03-07 20:32:52 INFO Server:403 - Started @1809ms
2018-03-07 20:32:52 INFO AbstractConnector:270 - Started ServerConnector@6c345c5f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-03-07 20:32:52 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5d25e6bb{/jobs,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a3e3e8b{/jobs/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@71104a4{/jobs/job,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c9168dc{/jobs/job/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@549621f3{/stages,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32232e55{/stages/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@37ebc9d8{/stages/stage,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e9319f{/stages/stage/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7bf9b098{/stages/pool,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77307458{/stages/pool/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@290b1b2e{/storage,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@33617539{/storage/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5db4c359{/storage/rdd,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@18e7143f{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74cec793{/environment,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4f8969b0{/environment/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@192f2f27{/executors,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c65a5ef{/executors/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@b672aa8{/executors/threadDump,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@ec0c838{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5cc69cfe{/static,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62923ee6{/,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@f19c9d2{/api,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a067c25{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3bde62ff{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-pi-1520454766917-driver-svc.default.svc.cluster.local:4040
2018-03-07 20:32:52 INFO SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar at spark://spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar with timestamp 1520454772095
2018-03-07 20:32:52 WARN KubernetesClusterManager:66 - The executor's init-container config map was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 20:32:52 WARN KubernetesClusterManager:66 - The executor's init-container config map key was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 20:32:53 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2018-03-07 20:32:53 INFO NettyBlockTransferService:54 - Server created on spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079
2018-03-07 20:32:53 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-03-07 20:32:53 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO BlockManagerMasterEndpoint:54 - Registering block manager spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079 with 408.9 MB RAM, BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58a2b4c{/metrics/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:54 INFO KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 1
2018-03-07 20:32:54 INFO KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 2
2018-03-07 20:32:54 INFO KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 3
2018-03-07 20:32:54 INFO KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 4
2018-03-07 20:32:54 INFO KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 5
2018-03-07 20:33:22 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
2018-03-07 20:33:22 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
2018-03-07 20:33:22 INFO SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
2018-03-07 20:33:22 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25e49cb2{/SQL,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@23382f76{/SQL/json,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@633a2e99{/SQL/execution,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5b84f14{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27a97e08{/static/sql,null,AVAILABLE,@Spark}
2018-03-07 20:33:23 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-03-07 20:33:23 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2018-03-07 20:33:23 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
2018-03-07 20:33:23 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2018-03-07 20:33:23 INFO DAGScheduler:54 - Parents of final stage: List()
2018-03-07 20:33:23 INFO DAGScheduler:54 - Missing parents: List()
2018-03-07 20:33:23 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2018-03-07 20:33:23 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 408.9 MB)
2018-03-07 20:33:23 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 408.9 MB)
2018-03-07 20:33:23 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079 (size: 1172.0 B, free: 408.9 MB)
2018-03-07 20:33:23 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
2018-03-07 20:33:23 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
2018-03-07 20:33:23 INFO KubernetesTaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-03-07 20:33:38 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2018-03-07 20:33:53 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
$ kubectl logs spark-pi-1520454766917-exec-1
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt && readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi && if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi && ${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp $SPARK_CLASSPATH org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP'
2018-03-07 20:32:57 INFO CoarseGrainedExecutorBackend:2574 - Started daemon with process name: 15@spark-pi-1520454766917-exec-1
2018-03-07 20:32:57 INFO SignalUtils:54 - Registered signal handler for TERM
2018-03-07 20:32:57 INFO SignalUtils:54 - Registered signal handler for HUP
2018-03-07 20:32:57 INFO SignalUtils:54 - Registered signal handler for INT
2018-03-07 20:32:57 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-07 20:32:57 INFO SecurityManager:54 - Changing view acls to: root
2018-03-07 20:32:57 INFO SecurityManager:54 - Changing modify acls to: root
2018-03-07 20:32:57 INFO SecurityManager:54 - Changing view acls groups to:
2018-03-07 20:32:57 INFO SecurityManager:54 - Changing modify acls groups to:
2018-03-07 20:32:57 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2018-03-07 20:33:02 WARN TransportClientFactory:166 - DNS resolution for spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078 took 5008 ms
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
My understanding is that opening pod ports is only a formality and that these ports are not actually necessary to set. But this might perhaps vary from cluster to cluster, and might be specific to later versions of Kubernetes. @foxish for thoughts.
I believe it can vary from cluster to cluster and believe explicitly opening ports is the right way to move forward. Some deployments can have Pod ports open by default which is likely why it has been working for many folks, but for those who have a network policy of some sort set where ports must be explicitly opened, the project as is does not work in those clusters.
I've noticed a similar error with setting driver label
--conf "spark.kubernetes.driver.label.spark-app-name=pi"
Which gets added to the service and then can't resolve.
kubectl describe service pi-ce2cd87021863a03986a4a98cb074128-driver
Name: pi-ce2cd87021863a03986a4a98cb074128-driver-svc
Namespace: default
Labels: <none>
Annotations: <none>
Selector: spark-app-name=pi,spark-app-selector=spark-8b71b8ec3f6843428f2b9e8ea9b1bfb6,spark-role=driver
Type: ClusterIP
IP: None
Port: driver-rpc-port 7078/TCP
TargetPort: 7078/TCP
Endpoints: 10.1.4.86:7078
Port: blockmanager 7079/TCP
TargetPort: 7079/TCP
Endpoints: 10.1.4.86:7079
Session Affinity: None
Events: <none>
Trying to run the SparkPi example per the instructions on the website but hitting an error with the executor. When I run the
spark-submit
command I see Kubernetes creates the driver and executor pods as expected, but the executor pod then dies with error (in the logs)In the driver pod logs I see:
repeated several times (modulo timestamp), I assume because the executor went down so there are no workers to accept resources.
Going back to the executor pod error I see there is a
org-apache-spark-examples-sparkpi-1519271450264-driver-svc
service with expected label selectors and ports:Querying for the pods with corresponding labels I see the driver started that was mentioned earlier. Looking into the driver pods description I see however no ports are exposed:
As far as I can tell this is the corresponding code that creates the driver pod which also doesn't seem to expose ports, so my theory is the executor is trying to call the service and the service is trying to route the request but can't because the pod it is selecting has no ports exposed. However I am also doubtful of this theory since this should definitely be something that would be caught in integration tests right?
Any help would be appreciated.
Some more info below:
Command run
The command is run by SSH-ing into the pod specified below with
k exec -it <pod name> /bin/bash
- I did this since this seemed to be the easiest way to get thespark-submit
and examples JAR .Logs
Here are more logs from the driver pod before it starts repeating the last log line:
Here are logs from the executor pod: