apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Executor can not connect to driver service #547

Closed benni82 closed 6 years ago

benni82 commented 6 years ago

the driver pod

> kubectl exec -it spark-pi-1510312707340-driver -- netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:7078          0.0.0.0:*               LISTEN      16/java
tcp        0      0 127.0.0.1:7079          0.0.0.0:*               LISTEN      16/java
tcp        0      0 0.0.0.0:4040            0.0.0.0:*               LISTEN      16/java

the executor logs

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:202)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to spark-pi-1510312707340-driver-svc.default.svc.cluster.local/10.244.194.72:7078
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: spark-pi-1510312707340-driver-svc.default.svc.cluster.local/10.244.194.72:7078
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)

the driver logs

2017-11-10 11:18:31 INFO  SparkContext:54 - Running Spark version 2.2.0-k8s-0.5.0
2017-11-10 11:18:32 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-10 11:18:32 INFO  SparkContext:54 - Submitted application: Spark Pi
2017-11-10 11:18:32 INFO  SecurityManager:54 - Changing view acls to: root
2017-11-10 11:18:32 INFO  SecurityManager:54 - Changing modify acls to: root
2017-11-10 11:18:32 INFO  SecurityManager:54 - Changing view acls groups to:
2017-11-10 11:18:32 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-11-10 11:18:32 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-11-10 11:18:32 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2017-11-10 11:18:32 INFO  SparkEnv:54 - Registering MapOutputTracker
2017-11-10 11:18:32 INFO  SparkEnv:54 - Registering BlockManagerMaster
2017-11-10 11:18:32 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2017-11-10 11:18:32 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2017-11-10 11:18:33 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-1562fc45-52a1-4eba-81f9-01a11221533d
2017-11-10 11:18:33 INFO  MemoryStore:54 - MemoryStore started with capacity 408.9 MB
2017-11-10 11:18:33 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2017-11-10 11:18:33 INFO  log:192 - Logging initialized @2963ms
2017-11-10 11:18:33 INFO  Server:345 - jetty-9.3.z-SNAPSHOT
2017-11-10 11:18:33 INFO  Server:403 - Started @3102ms
2017-11-10 11:18:33 INFO  AbstractConnector:270 - Started ServerConnector@510c453{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2017-11-10 11:18:33 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3ebff828{/jobs,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@66629f63{/jobs/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27a5328c{/jobs/job,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b5966e1{/jobs/job/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1568159{/stages,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f80fafe{/stages/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@f9879ac{/stages/stage,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@224b4d61{/stages/stage/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@303e3593{/stages/pool,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@362a019c{/stages/pool/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5c48c0c0{/storage,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@674c583e{/storage/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3f23a3a0{/storage/rdd,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5fb97279{/storage/rdd/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61861a29{/environment,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@25bcd0c7{/environment/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@63cd604c{/executors,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a4e343{/executors/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62dae245{/executors/threadDump,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fff253c{/executors/threadDump/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@591e58fa{/static,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2f7a7219{/,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a1d593e{/api,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@664a9613{/jobs/job/kill,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15a902e7{/stages/stage/kill,null,AVAILABLE,@Spark}
2017-11-10 11:18:33 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://spark-pi-1510312707340-driver-svc.default.svc.cluster.local:4040
2017-11-10 11:18:33 INFO  SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar at spark://spark-pi-1510312707340-driver-svc.default.svc.cluster.local:7078/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar with timestamp 1510312713625
2017-11-10 11:18:33 WARN  KubernetesClusterManager:66 - The executor's init-container config map was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2017-11-10 11:18:33 WARN  KubernetesClusterManager:66 - The executor's init-container config map key was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2017-11-10 11:18:35 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2017-11-10 11:18:35 INFO  NettyBlockTransferService:54 - Server created on spark-pi-1510312707340-driver-svc.default.svc.cluster.local:7079
2017-11-10 11:18:35 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2017-11-10 11:18:35 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, spark-pi-1510312707340-driver-svc.default.svc.cluster.local, 7079, None)
2017-11-10 11:18:35 INFO  BlockManagerMasterEndpoint:54 - Registering block manager spark-pi-1510312707340-driver-svc.default.svc.cluster.local:7079 with 408.9 MB RAM, BlockManagerId(driver, spark-pi-1510312707340-driver-svc.default.svc.cluster.local, 7079, None)
2017-11-10 11:18:35 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, spark-pi-1510312707340-driver-svc.default.svc.cluster.local, 7079, None)
2017-11-10 11:18:35 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, spark-pi-1510312707340-driver-svc.default.svc.cluster.local, 7079, None)
2017-11-10 11:18:35 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@537c8c7e{/metrics/json,null,AVAILABLE,@Spark}
2017-11-10 11:18:37 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 1
2017-11-10 11:18:37 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 2
2017-11-10 11:18:37 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 3
2017-11-10 11:18:37 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 4
2017-11-10 11:18:37 INFO  KubernetesClusterSchedulerBackend:54 - Requesting a new executor, total executors is now 5
2017-11-10 11:19:04 INFO  KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
2017-11-10 11:19:04 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
2017-11-10 11:19:04 INFO  SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
2017-11-10 11:19:04 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4e73b552{/SQL,null,AVAILABLE,@Spark}
2017-11-10 11:19:04 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2cec704c{/SQL/json,null,AVAILABLE,@Spark}
2017-11-10 11:19:04 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3659d7b1{/SQL/execution,null,AVAILABLE,@Spark}
2017-11-10 11:19:04 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3bdb2c78{/SQL/execution/json,null,AVAILABLE,@Spark}
2017-11-10 11:19:04 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e1b9411{/static/sql,null,AVAILABLE,@Spark}
2017-11-10 11:19:06 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2017-11-10 11:19:07 INFO  SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 5 output partitions
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Parents of final stage: List()
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Missing parents: List()
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2017-11-10 11:19:07 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 408.9 MB)
2017-11-10 11:19:07 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 408.9 MB)
2017-11-10 11:19:07 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on spark-pi-1510312707340-driver-svc.default.svc.cluster.local:7079 (size: 1172.0 B, free: 408.9 MB)
2017-11-10 11:19:07 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
2017-11-10 11:19:07 INFO  DAGScheduler:54 - Submitting 5 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
2017-11-10 11:19:07 INFO  KubernetesTaskSchedulerImpl:54 - Adding task set 0.0 with 5 tasks
2017-11-10 11:19:23 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2017-11-10 11:19:38 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2017-11-10 11:19:53 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2017-11-10 11:20:08 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2017-11-10 11:20:23 WARN  KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

the submit command is

 bin/spark-submit \
    --deploy-mode cluster \
    --class org.apache.spark.examples.SparkPi \
    --master k8s://https://xxxxxxx:6443 \
    --kubernetes-namespace default \
    --conf spark.executor.instances=6 \
    --conf spark.app.name=spark-pi \
    --conf spark.kubernetes.driverEnv.SPARK_LOCAL_IP=0.0.0.0 \
    --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \
    --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \
    local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 5
liyinan926 commented 6 years ago

Have you tried running it without --conf spark.kubernetes.driverEnv.SPARK_LOCAL_IP=0.0.0.0?

liyinan926 commented 6 years ago

BTW: can you check the value of environment variable SPARK_DRIVER_BIND_ADDRESS in the driver container?

benni82 commented 6 years ago

I also run task with --conf spark.kubernetes.driverEnv.SPARK_LOCAL_IP=0.0.0.0, and runps ef | morein the driver, but still can not work correctly.

-Dspark.driver.bindAddress= org.apache.spark.examples.SparkPi 5
benni82 commented 6 years ago

I run task with '--conf spark.kubernetes.driverEnv.SPARK_DRIVER_BIND_ADDRESS=0.0.0.0' , it work correctly!

liyinan926 commented 6 years ago

OK, so if you don't set SPARK_DRIVER_BIND_ADDRESS explicitly, what's its value in the driver container? And have you run it without setting SPARK_DRIVER_BIND_ADDRESS and SPARK_LOCAL_IP? The Kubernetes downward API is used to set SPARK_DRIVER_BIND_ADDRESS automatically to the driver pod IP, so you shouldn't need to set it explicitly.

benni82 commented 6 years ago

I have to set it explicitly, if run without conf SPARK_DRIVER_BIND_ADDRESS, the driver spark.driver.bindAddress value is empty.

benni82 commented 6 years ago

when run without --conf spark.kubernetes.driverEnv.SPARK_LOCAL_IP=0.0.0.0 ps ef | more

14 root       0:16 /usr/lib/jvm/java-1.8-openjdk/bin/java -Dspark.driver.host=spark-pi-1510572734487-driver-svc.default.svc.cluster.local -Dspark.kubernetes.executor.podNamePrefix=spark-pi-1510572734487 -Dspark.driver.port=7078 -Dspark.driver.bindAddress=spark-pi-1510572734487-driver-svc.default.svc.cluster.local -Dspark.master=k8s://https://10.218.13age=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 -Dspark.executor.instances=3 -Dspark.driver.blockManager.port=7079 -Dspark.submit.deployMode=cluster -Dspark.app.id=spark-4d60858edcfe47189501d25962c6676b -Dspark.kubernetes.namespace=default -Dspark.app.name=spark-pi -Dspark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 -Dspark.kubernetes.driver.pod.name=spark-pi-1510572734487-driver -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar -cp /opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar:/opt/spark/jars/* -Xms1g -Xmx1g -Dspark.driver.bindAddress= org.apache.spark.examples.SparkPi 100
liyinan926 commented 6 years ago

when run without --conf spark.kubernetes.driverEnv.SPARK_LOCAL_IP=0.0.0.0

I saw both -Dspark.driver.bindAddress=spark-pi-1510572734487-driver-svc.default.svc.cluster.local and -Dspark.driver.bindAddress= in the ps output, which is unexpected. Just to confirm that in this run, you didn't set SPARK_DRIVER_BIND_ADDRESS, right?

Which version of Spark on K8s are you running, are you using the latest release distribution to submit your job? What's the Kubernetes version of your cluster?

benni82 commented 6 years ago

yes, I didn't set SPARK_DRIVER_BIND_ADDRESS .

I used spark-2.2.0-k8s-0.4.0-bin-2.7.3 but use 0.5.0 image kubespark/spark-driver:v2.2.0-kubernetes-0.5.0

bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://xxxxxxx:6443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=3 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 100
benni82 commented 6 years ago

I download spark-2.2.0-k8s-0.5.0-bin-2.7.3, and now can run correctly. It seems I used 0.4.0 version to submit newest(0.5.0) task, which made mistake.

foxish commented 6 years ago

Is this ok to close @benni82