apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Trouble running SparkPi example on Kubernetes 1.8.x #617

Open adelbertc opened 6 years ago

adelbertc commented 6 years ago

Trying to run the SparkPi example per the instructions on the website but hitting an error with the executor. When I run the spark-submit command I see Kubernetes creates the driver and executor pods as expected, but the executor pod then dies with error (in the logs)

Caused by: java.io.IOException: Failed to connect to org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7078

In the driver pod logs I see:

2018-02-22 05:42:23 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

repeated several times (modulo timestamp), I assume because the executor went down so there are no workers to accept resources.

Going back to the executor pod error I see there is a org-apache-spark-examples-sparkpi-1519271450264-driver-svc service with expected label selectors and ports:

$ k describe svc org-apache-spark-examples-sparkpi-1519271450264-driver-svc
Name:              org-apache-spark-examples-sparkpi-1519271450264-driver-svc
Namespace:         dev
Labels:            <none>
Annotations:       <none>
Selector:          spark-app-selector=spark-30c16e3c71b24658b245f7eeaacbb3c8,spark-role=driver
Type:              ClusterIP
IP:                None
Port:              driver-rpc-port  7078/TCP
TargetPort:        7078/TCP
Endpoints:         10.239.120.2:7078
Port:              blockmanager  7079/TCP
TargetPort:        7079/TCP

Querying for the pods with corresponding labels I see the driver started that was mentioned earlier. Looking into the driver pods description I see however no ports are exposed:

$ k get po -l spark-app-selector=spark-30c16e3c71b24658b245f7eeaacbb3c8,spark-role=driver
NAME                                                     READY     STATUS    RESTARTS   AGE
org-apache-spark-examples-sparkpi-1519271450264-driver   1/1       Running   0          1h
$ k describe po org-apache-spark-examples-sparkpi-1519271450264-driver 
Name:         org-apache-spark-examples-sparkpi-1519271450264-driver
Namespace:    dev
Node:         k8s-plain-jane-ysj5ij/192.168.50.79
Start Time:   Thu, 22 Feb 2018 03:50:34 +0000
Labels:       spark-app-selector=spark-30c16e3c71b24658b245f7eeaacbb3c8
              spark-role=driver
Annotations:  spark-app-name=org.apache.spark.examples.SparkPi
Status:       Running
IP:           10.239.120.2
Containers:
  spark-kubernetes-driver:
    Container ID:   docker://b5695ae0520cc065b6372df37e25a40804da7ebaf54ee653bf5d71ecda7c375c
    Image:          kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
    Image ID:       docker-pullable://kubespark/spark-driver@sha256:7539c28c4dbfc55addf5c3f439d26b964c4b04cd780a066c708242867356eb0b
    Port:           <none>

As far as I can tell this is the corresponding code that creates the driver pod which also doesn't seem to expose ports, so my theory is the executor is trying to call the service and the service is trying to route the request but can't because the pod it is selecting has no ports exposed. However I am also doubtful of this theory since this should definitely be something that would be caught in integration tests right?

Any help would be appreciated.

Some more info below:


Command run

$ /opt/spark/bin/spark-submit --verbose --deploy-mode cluster --class org.apache.spark.examples.SparkPi --master k8s://https://192.168.50.77:6443 --executor-memory 1G --kubernetes-namespace dev --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.executor.instances=1 --conf spark.app-name=spark-pi --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar

The command is run by SSH-ing into the pod specified below with k exec -it <pod name> /bin/bash - I did this since this seemed to be the easiest way to get the spark-submit and examples JAR .

overlord@k8s-belle-bastion:~/adelbert/spark$ cat spark-driver.yaml 
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: spark-driver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-driver
  template:
    metadata:
      labels:
        app: spark-driver
    spec:
      serviceAccountName: spark
      containers:
      - name: spark-driver
        image: kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
        command: [ "sleep", "3600" ]

Logs

Here are more logs from the driver pod before it starts repeating the last log line:

++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS'
2018-02-22 03:50:35 INFO  SparkContext:54 - Running Spark version 2.2.0-k8s-0.5.0
2018-02-22 03:50:36 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-02-22 03:50:36 WARN  SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2018-02-22 03:50:36 INFO  SparkContext:54 - Submitted application: Spark Pi
2018-02-22 03:50:36 INFO  SecurityManager:54 - Changing view acls to: root
2018-02-22 03:50:36 INFO  SecurityManager:54 - Changing modify acls to: root
2018-02-22 03:50:36 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-02-22 03:50:36 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-02-22 03:50:36 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-02-22 03:50:36 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2018-02-22 03:50:36 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-02-22 03:50:36 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-02-22 03:50:36 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-02-22 03:50:36 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-02-22 03:50:36 INFO  DiskBlockManager:54 - Created local directory at /mnt/tmp/spark-local/spark-85ba3b56-3dd5-4b97-afb9-2f0ca99a74a3/blockmgr-f842050a-cff3-4a19-96c3-54eab202fccd
2018-02-22 03:50:36 INFO  MemoryStore:54 - MemoryStore started with capacity 408.9 MB
2018-02-22 03:50:36 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-02-22 03:50:36 INFO  log:192 - Logging initialized @1548ms
2018-02-22 03:50:37 INFO  Server:345 - jetty-9.3.z-SNAPSHOT
2018-02-22 03:50:37 INFO  Server:403 - Started @1642ms
2018-02-22 03:50:37 INFO  AbstractConnector:270 - Started ServerConnector@35beb15e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-02-22 03:50:37 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2cf23c81{/jobs,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b5966e1{/jobs/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1568159{/jobs/job,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3af17be2{/jobs/job/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@37f21974{/stages,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e521c1e{/stages/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5d5d9e5{/stages/stage,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1d9bec4d{/stages/stage/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@10c8f62{/stages/pool,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25f7391e{/stages/pool/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5ab14cb9{/storage,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439a8f59{/storage/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@31024624{/storage/rdd,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32cb636e{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@40dd3977{/environment,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a1d204a{/environment/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b6579e8{/executors,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6c6357f9{/executors/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3954d008{/executors/threadDump,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@593e824f{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6d8792db{/static,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7859e786{/,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@314b8f2d{/api,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5af28b27{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4985cbcb{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-02-22 03:50:37 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:4040
2018-02-22 03:50:37 INFO  SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar at spark://org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7078/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar with timestamp 1519271437125
2018-02-22 03:50:37 WARN  KubernetesClusterManager:66 - The executor's init-container config map was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-02-22 03:50:37 WARN  KubernetesClusterManager:66 - The executor's init-container config map key was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-02-22 03:50:38 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2018-02-22 03:50:38 INFO  NettyBlockTransferService:54 - Server created on org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7079
2018-02-22 03:50:38 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-02-22 03:50:38 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local, 7079, None)
2018-02-22 03:50:38 INFO  BlockManagerMasterEndpoint:54 - Registering block manager org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7079 with 408.9 MB RAM, BlockManagerId(driver, org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local, 7079, None)
2018-02-22 03:50:38 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local, 7079, None)
2018-02-22 03:50:38 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local, 7079, None)
2018-02-22 03:50:38 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@573284a5{/metrics/json,null,AVAILABLE,@Spark}
2018-02-22 03:50:39 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 1
2018-02-22 03:51:07 INFO  KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
2018-02-22 03:51:07 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
2018-02-22 03:51:07 INFO  SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
2018-02-22 03:51:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@771cbb1a{/SQL,null,AVAILABLE,@Spark}
2018-02-22 03:51:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b2e0f78{/SQL/json,null,AVAILABLE,@Spark}
2018-02-22 03:51:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@64693226{/SQL/execution,null,AVAILABLE,@Spark}
2018-02-22 03:51:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@117bcfdc{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-02-22 03:51:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2324bfe7{/static/sql,null,AVAILABLE,@Spark}
2018-02-22 03:51:08 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-02-22 03:51:08 INFO  SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Parents of final stage: List()
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Missing parents: List()
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2018-02-22 03:51:08 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 408.9 MB)
2018-02-22 03:51:08 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 408.9 MB)
2018-02-22 03:51:08 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7079 (size: 1172.0 B, free: 408.9 MB)
2018-02-22 03:51:08 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
2018-02-22 03:51:08 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
2018-02-22 03:51:08 INFO  KubernetesTaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-02-22 03:51:23 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Here are logs from the executor pod:

++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp $SPARK_CLASSPATH org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP'
2018-02-22 03:50:57 INFO  CoarseGrainedExecutorBackend:2574 - Started daemon with process name: 15@org-apache-spark-examples-sparkpi-1519271450264-exec-1
2018-02-22 03:50:57 INFO  SignalUtils:54 - Registered signal handler for TERM
2018-02-22 03:50:57 INFO  SignalUtils:54 - Registered signal handler for HUP
2018-02-22 03:50:57 INFO  SignalUtils:54 - Registered signal handler for INT
2018-02-22 03:50:58 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-02-22 03:50:58 INFO  SecurityManager:54 - Changing view acls to: root
2018-02-22 03:50:58 INFO  SecurityManager:54 - Changing modify acls to: root
2018-02-22 03:50:58 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-02-22 03:50:58 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-02-22 03:50:58 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to org-apache-spark-examples-sparkpi-1519271450264-driver-svc.dev.svc.cluster.local:7078
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
adelbertc commented 6 years ago

As suspected it was due to the driver container not exposing the ports on creation. While the service selected the driver pod as expected, the driver did not have any ports open for the service to proxy requests into.

Filed PR: https://github.com/apache-spark-on-k8s/spark/pull/618

rvesse commented 6 years ago

We've seen a similar issue under certain network overlay configurations. In our case the problem seemed to be that because only a headless service gets created with no IP address the cluster hostname for the service would not resolve but just the service name itself did. This only happened under Romana as it would work fine under other network overlays e.g. Flannel so this issue may be caused by specific network overlays

rvesse commented 6 years ago

Here's our stack trace for comparison:


2018-02-08 19:24:04 WARN  TransportClientFactory:166 - DNS resolution for spark-pi-1518117806498-driver-svc.default.svc.cluster.local:7078 took 5008 ms
--
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
  at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
  at  org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
  at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
  at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.io.IOException: Failed to connect to spark-pi-1518117806498-driver-svc.default.svc.cluster.local:7078
  at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
  at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
  at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
  at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
  at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
  at sun.nio.ch.Net.checkAddress(Net.java:101)
  at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
  at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
  at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
  at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
  at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
  at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
  at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
  at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
  at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
  at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
  at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
  at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507)
  at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
  at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
  at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
... 1 more
mlorenzana commented 6 years ago

I'm seeing the exact same thing on Kubernetes 1.9 with Flannel, using spark v2.2.0-kubernetes-0.5.0

Command Run:

spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://x.x.x.x:6443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
$ kubectl describe service spark-pi-1520454766917-driver-svc
Name:              spark-pi-1520454766917-driver-svc
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          spark-app-selector=spark-97c6f27b591a48c88c2a7ae9a64f5af2,spark-role=driver
Type:              ClusterIP
IP:                None
Port:              driver-rpc-port  7078/TCP
TargetPort:        7078/TCP
Endpoints:         10.244.4.16:7078
Port:              blockmanager  7079/TCP
TargetPort:        7079/TCP
Endpoints:         10.244.4.16:7079
Session Affinity:  None
Events:            <none>
$ kubectl describe pod  spark-pi-1520454766917-driver 
Name:         spark-pi-1520454766917-driver
Namespace:    default
Node:         rd-cluster-5/10.176.49.113
Start Time:   Wed, 07 Mar 2018 15:32:48 -0500
Labels:       spark-app-selector=spark-97c6f27b591a48c88c2a7ae9a64f5af2
              spark-role=driver
Annotations:  spark-app-name=spark-pi
Status:       Running
IP:           10.244.4.16
Containers:
  spark-kubernetes-driver:
    Container ID:   docker://07e7d70d94c1df7b24323e202f4c9f2ac0cbc18402f8c67f56f575909f385856
    Image:          kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
    Image ID:       docker-pullable://kubespark/spark-driver@sha256:7539c28c4dbfc55addf5c3f439d26b964c4b04cd780a066c708242867356eb0b
    Port:           <none>
    State:          Running
      Started:      Wed, 07 Mar 2018 15:32:50 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  1408Mi
    Requests:
      cpu:     1
      memory:  1Gi
$ kubectl logs spark-pi-1520454766917-driver 
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_DRIVER_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp $SPARK_CLASSPATH -Xms$SPARK_DRIVER_MEMORY -Xmx$SPARK_DRIVER_MEMORY -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS $SPARK_DRIVER_ARGS'
2018-03-07 20:32:50 INFO  SparkContext:54 - Running Spark version 2.2.0-k8s-0.5.0
2018-03-07 20:32:51 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-07 20:32:51 WARN  SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2018-03-07 20:32:51 INFO  SparkContext:54 - Submitted application: Spark Pi
2018-03-07 20:32:51 INFO  SecurityManager:54 - Changing view acls to: root
2018-03-07 20:32:51 INFO  SecurityManager:54 - Changing modify acls to: root
2018-03-07 20:32:51 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-03-07 20:32:51 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-03-07 20:32:51 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-03-07 20:32:51 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2018-03-07 20:32:51 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-03-07 20:32:51 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-03-07 20:32:51 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-03-07 20:32:51 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-03-07 20:32:51 INFO  DiskBlockManager:54 - Created local directory at /mnt/tmp/spark-local/spark-38688487-2c90-4b12-a24d-6a1ebb69dae1/blockmgr-addce8bd-8f9c-4274-929a-bd6020efec55
2018-03-07 20:32:51 INFO  MemoryStore:54 - MemoryStore started with capacity 408.9 MB
2018-03-07 20:32:51 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-03-07 20:32:51 INFO  log:192 - Logging initialized @1738ms
2018-03-07 20:32:51 INFO  Server:345 - jetty-9.3.z-SNAPSHOT
2018-03-07 20:32:52 INFO  Server:403 - Started @1809ms
2018-03-07 20:32:52 INFO  AbstractConnector:270 - Started ServerConnector@6c345c5f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-03-07 20:32:52 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5d25e6bb{/jobs,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a3e3e8b{/jobs/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@71104a4{/jobs/job,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c9168dc{/jobs/job/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@549621f3{/stages,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32232e55{/stages/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@37ebc9d8{/stages/stage,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e9319f{/stages/stage/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7bf9b098{/stages/pool,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@77307458{/stages/pool/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@290b1b2e{/storage,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@33617539{/storage/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5db4c359{/storage/rdd,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@18e7143f{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74cec793{/environment,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4f8969b0{/environment/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@192f2f27{/executors,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c65a5ef{/executors/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@b672aa8{/executors/threadDump,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@ec0c838{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5cc69cfe{/static,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62923ee6{/,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@f19c9d2{/api,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a067c25{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3bde62ff{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-03-07 20:32:52 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-pi-1520454766917-driver-svc.default.svc.cluster.local:4040
2018-03-07 20:32:52 INFO  SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar at spark://spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar with timestamp 1520454772095
2018-03-07 20:32:52 WARN  KubernetesClusterManager:66 - The executor's init-container config map was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 20:32:52 WARN  KubernetesClusterManager:66 - The executor's init-container config map key was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2018-03-07 20:32:53 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2018-03-07 20:32:53 INFO  NettyBlockTransferService:54 - Server created on spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079
2018-03-07 20:32:53 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-03-07 20:32:53 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO  BlockManagerMasterEndpoint:54 - Registering block manager spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079 with 408.9 MB RAM, BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, spark-pi-1520454766917-driver-svc.default.svc.cluster.local, 7079, None)
2018-03-07 20:32:53 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58a2b4c{/metrics/json,null,AVAILABLE,@Spark}
2018-03-07 20:32:54 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 1
2018-03-07 20:32:54 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 2
2018-03-07 20:32:54 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 3
2018-03-07 20:32:54 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 4
2018-03-07 20:32:54 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 5
2018-03-07 20:33:22 INFO  KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
2018-03-07 20:33:22 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
2018-03-07 20:33:22 INFO  SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
2018-03-07 20:33:22 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25e49cb2{/SQL,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@23382f76{/SQL/json,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@633a2e99{/SQL/execution,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5b84f14{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-03-07 20:33:22 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27a97e08{/static/sql,null,AVAILABLE,@Spark}
2018-03-07 20:33:23 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-03-07 20:33:23 INFO  SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Parents of final stage: List()
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Missing parents: List()
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2018-03-07 20:33:23 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 408.9 MB)
2018-03-07 20:33:23 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 408.9 MB)
2018-03-07 20:33:23 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7079 (size: 1172.0 B, free: 408.9 MB)
2018-03-07 20:33:23 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
2018-03-07 20:33:23 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
2018-03-07 20:33:23 INFO  KubernetesTaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-03-07 20:33:38 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2018-03-07 20:33:53 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
$ kubectl logs spark-pi-1520454766917-exec-1
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" &&     env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'\'' > /tmp/java_opts.txt &&     readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt &&     if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_DIR/." .; fi &&     if ! [ -z ${SPARK_MOUNTED_FILES_FROM_SECRET_DIR+x} ]; then cp -R "$SPARK_MOUNTED_FILES_FROM_SECRET_DIR/." .; fi &&     ${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp $SPARK_CLASSPATH org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP'
2018-03-07 20:32:57 INFO  CoarseGrainedExecutorBackend:2574 - Started daemon with process name: 15@spark-pi-1520454766917-exec-1
2018-03-07 20:32:57 INFO  SignalUtils:54 - Registered signal handler for TERM
2018-03-07 20:32:57 INFO  SignalUtils:54 - Registered signal handler for HUP
2018-03-07 20:32:57 INFO  SignalUtils:54 - Registered signal handler for INT
2018-03-07 20:32:57 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-03-07 20:32:57 INFO  SecurityManager:54 - Changing view acls to: root
2018-03-07 20:32:57 INFO  SecurityManager:54 - Changing modify acls to: root
2018-03-07 20:32:57 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-03-07 20:32:57 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-03-07 20:32:57 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-03-07 20:33:02 WARN  TransportClientFactory:166 - DNS resolution for spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078 took 5008 ms
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to spark-pi-1520454766917-driver-svc.default.svc.cluster.local:7078
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:101)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
    at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:242)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:205)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1226)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
    at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:540)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:525)
    at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:507)
    at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:970)
    at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:215)
    at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
mccheah commented 6 years ago

My understanding is that opening pod ports is only a formality and that these ports are not actually necessary to set. But this might perhaps vary from cluster to cluster, and might be specific to later versions of Kubernetes. @foxish for thoughts.

adelbertc commented 6 years ago

I believe it can vary from cluster to cluster and believe explicitly opening ports is the right way to move forward. Some deployments can have Pod ports open by default which is likely why it has been working for many folks, but for those who have a network policy of some sort set where ports must be explicitly opened, the project as is does not work in those clusters.

emcfarlane commented 6 years ago

I've noticed a similar error with setting driver label

--conf "spark.kubernetes.driver.label.spark-app-name=pi"

Which gets added to the service and then can't resolve.

kubectl describe service pi-ce2cd87021863a03986a4a98cb074128-driver
Name:              pi-ce2cd87021863a03986a4a98cb074128-driver-svc
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          spark-app-name=pi,spark-app-selector=spark-8b71b8ec3f6843428f2b9e8ea9b1bfb6,spark-role=driver
Type:              ClusterIP
IP:                None
Port:              driver-rpc-port  7078/TCP
TargetPort:        7078/TCP
Endpoints:         10.1.4.86:7078
Port:              blockmanager  7079/TCP
TargetPort:        7079/TCP
Endpoints:         10.1.4.86:7079
Session Affinity:  None
Events:            <none>