apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Failed to run the sample spark-pi test using spark-submit on the doc #478

Closed rootsongjc closed 7 years ago

rootsongjc commented 7 years ago

Environment

Command

./bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://172.20.0.113:6443 \
  --kubernetes-namespace spark-cluster \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-driver:v2.1.0-kubernetes-0.3.1 \
  --conf spark.kubernetes.executor.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-executor:v2.1.0-kubernetes-0.3.1 \
  --conf spark.kubernetes.initcontainer.docker.image=sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-init:v2.1.0-kubernetes-0.3.1 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.3.1-SNAPSHOT.jar

I pulled the docker images and pushed to my registry.

Logs

Spark-submit logs

2017-09-05 14:45:56 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new state:
     pod name: spark-pi-1504593950039-driver
     namespace: spark-cluster
     labels: spark-app-selector -> spark-81cd1d33adbd4f728f7c609356b54c43, spark-role -> driver
     pod uid: dbf66ecf-9205-11e7-970c-f4e9d49f8ed0
     creation time: 2017-09-05T06:45:52Z
     service account name: default
     volumes: default-token-klxp8
     node name: 172.20.0.115
     start time: 2017-09-05T06:45:52Z
     container images: sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-driver:v2.1.0-kubernetes-0.3.1
     phase: Failed
     status: [ContainerStatus(containerID=docker://53de39eb83435a344ef780aae83139229d4d6d78fa4e1655f9f81da95d89f439, image=sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-driver:v2.1.0-kubernetes-0.3.1, imageID=docker-pullable://sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-driver@sha256:19c3b76a34fee02104de0d859a60d79608ebd0b7ebae33ec3b86a71af777c833, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=ContainerStateTerminated(containerID=docker://53de39eb83435a344ef780aae83139229d4d6d78fa4e1655f9f81da95d89f439, exitCode=1, finishedAt=2017-09-05T06:45:55Z, message=null, reason=Error, signal=null, startedAt=null, additionalProperties={}), waiting=null, additionalProperties={}), additionalProperties={})]
2017-09-05 14:45:56 INFO  LoggingPodStatusWatcherImpl:54 - Container final statuses:

     Container name: spark-kubernetes-driver
     Container image: sz-pg-oam-docker-hub-001.tendcloud.com/library/kubespark-spark-driver:v2.1.0-kubernetes-0.3.1
     Container state: Terminated
     Exit code: 1
2017-09-05 14:45:56 INFO  Client:54 - Application spark-pi finished.
foxish commented 7 years ago

@rootsongjc, can you post the logs from the driver/executor pods?

rootsongjc commented 7 years ago

@foxish There is no pod created. I can't figure out what happened from spark-submit logs.

mccheah commented 7 years ago

I believe we might delete the driver pod entirely if we fail to submit the job. But at some point a driver pod should be created and eventually it will fail. Does a driver pod ever appear if we use this:

kubectl get pods -n <namespace> -w

Where the -w flag makes it such that the pods can be followed as they are created and terminated?

rootsongjc commented 7 years ago

@mccheah @foxish I found a few error pods by running the comman kubectl get pods --namespace spark-cluster, here is error pod logs.

Error Pod Logs

2017-09-05 08:54:41 INFO  SparkContext:54 - Running Spark version 2.1.0-k8s-0.3.1-SNAPSHOT
2017-09-05 08:54:41 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-09-05 08:54:41 INFO  SecurityManager:54 - Changing view acls to: root
2017-09-05 08:54:41 INFO  SecurityManager:54 - Changing modify acls to: root
2017-09-05 08:54:41 INFO  SecurityManager:54 - Changing view acls groups to:
2017-09-05 08:54:41 INFO  SecurityManager:54 - Changing modify acls groups to:
2017-09-05 08:54:41 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2017-09-05 08:54:42 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 36433.
2017-09-05 08:54:42 INFO  SparkEnv:54 - Registering MapOutputTracker
2017-09-05 08:54:42 INFO  SparkEnv:54 - Registering BlockManagerMaster
2017-09-05 08:54:42 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2017-09-05 08:54:42 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2017-09-05 08:54:42 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-c13482cf-dde9-4e2b-a185-0bce58575e43
2017-09-05 08:54:42 INFO  MemoryStore:54 - MemoryStore started with capacity 629.7 MB
2017-09-05 08:54:42 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2017-09-05 08:54:42 INFO  log:186 - Logging initialized @1622ms
2017-09-05 08:54:42 INFO  Server:327 - jetty-9.2.z-SNAPSHOT
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@25ddbbbb{/jobs,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@1536602f{/jobs/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@4ebea12c{/jobs/job,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@2a1edad4{/jobs/job/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@6256ac4f{/stages,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@44c79f32{/stages/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@7fcbe147{/stages/stage,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@235f4c10{/stages/stage/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@743cb8e0{/stages/pool,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@c7a975a{/stages/pool/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@2c1b9e4b{/storage,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@757d6814{/storage/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@649725e3{/storage/rdd,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@3c0fae6c{/storage/rdd/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@4c168660{/environment,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@52b56a3e{/environment/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@fd0e5b6{/executors,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@4eed46ee{/executors/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@36b0fcd5{/executors/threadDump,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@4fad94a7{/executors/threadDump/json,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@475835b1{/static,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@6326d182{/,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@5241cf67{/api,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@716a7124{/jobs/job/kill,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ContextHandler:744 - Started o.s.j.s.ServletContextHandler@77192705{/stages/stage/kill,null,AVAILABLE}
2017-09-05 08:54:42 INFO  ServerConnector:266 - Started ServerConnector@2b58f754{HTTP/1.1}{0.0.0.0:4040}
2017-09-05 08:54:42 INFO  Server:379 - Started @1741ms
2017-09-05 08:54:42 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2017-09-05 08:54:42 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://172.30.60.4:4040
2017-09-05 08:54:42 INFO  SparkContext:54 - Added JAR /opt/spark/examples/jars/spark-examples_2.11-2.1.0-k8s-0.3.1-SNAPSHOT.jar at spark://172.30.60.4:36433/jars/spark-examples_2.11-2.1.0-k8s-0.3.1-SNAPSHOT.jar with timestamp 1504601682607
2017-09-05 08:54:42 WARN  KubernetesClusterManager:66 - The executor's init-container config map was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2017-09-05 08:54:42 WARN  KubernetesClusterManager:66 - The executor's init-container config map key was not specified. Executors will therefore not attempt to fetch remote or submitted dependencies.
2017-09-05 08:54:43 ERROR KubernetesClusterSchedulerBackend:91 - Executor cannot find driver pod.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark-cluster/pods/spark-pi-1504601675797-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. User "system:serviceaccount:spark-cluster:default" cannot get pods in the namespace "spark-cluster"..
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:332)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:269)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:241)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:234)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:230)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:745)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:194)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:135)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:133)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:90)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2554)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
2017-09-05 08:54:43 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:139)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:133)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:90)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2554)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark-cluster/pods/spark-pi-1504601675797-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. User "system:serviceaccount:spark-cluster:default" cannot get pods in the namespace "spark-cluster"..
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:332)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:269)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:241)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:234)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:230)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:745)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:194)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:135)
    ... 11 more
2017-09-05 08:54:43 INFO  ServerConnector:306 - Stopped ServerConnector@2b58f754{HTTP/1.1}{0.0.0.0:4040}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@77192705{/stages/stage/kill,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@716a7124{/jobs/job/kill,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@5241cf67{/api,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@6326d182{/,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@475835b1{/static,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@4fad94a7{/executors/threadDump/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@36b0fcd5{/executors/threadDump,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@4eed46ee{/executors/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@fd0e5b6{/executors,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@52b56a3e{/environment/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@4c168660{/environment,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@3c0fae6c{/storage/rdd/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@649725e3{/storage/rdd,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@757d6814{/storage/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@2c1b9e4b{/storage,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@c7a975a{/stages/pool/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@743cb8e0{/stages/pool,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@235f4c10{/stages/stage/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@7fcbe147{/stages/stage,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@44c79f32{/stages/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@6256ac4f{/stages,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@2a1edad4{/jobs/job/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@4ebea12c{/jobs/job,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@1536602f{/jobs/json,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@25ddbbbb{/jobs,null,UNAVAILABLE}
2017-09-05 08:54:43 INFO  SparkUI:54 - Stopped Spark web UI at http://172.30.60.4:4040
2017-09-05 08:54:43 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2017-09-05 08:54:43 INFO  MemoryStore:54 - MemoryStore cleared
2017-09-05 08:54:43 INFO  BlockManager:54 - BlockManager stopped
2017-09-05 08:54:43 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2017-09-05 08:54:43 WARN  MetricsSystem:66 - Stopping a MetricsSystem that is not running
2017-09-05 08:54:43 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2017-09-05 08:54:43 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:139)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:133)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:90)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2554)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:501)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark-cluster/pods/spark-pi-1504601675797-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. User "system:serviceaccount:spark-cluster:default" cannot get pods in the namespace "spark-cluster"..
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:332)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:269)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:241)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:234)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:230)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:745)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:194)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:135)
    ... 11 more
2017-09-05 08:54:43 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-09-05 08:54:43 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-582de7e3-49eb-43d8-818a-f1536a10031f

Problems

From this log, we can see two problems:

kimoonkim commented 7 years ago

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark-cluster/pods/spark-pi-1504601675797-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. User "system:serviceaccount:spark-cluster:default" cannot get pods in the namespace "spark-cluster"..

Looks like this is a service account permission issue. Kubernetes v 1.6 enables RBAC by default and the default service account does not have necessary permission for the driver. The driver needs "edit" privilege for constructing the executor pod spec.

foxish commented 7 years ago

You're right. We should update our docs to include the step to create the right RBAC permissions.

On Sep 6, 2017 8:23 AM, "Kimoon Kim" notifications@github.com wrote:

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark- cluster/pods/spark-pi-1504601675797-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. User "system:serviceaccount:spark-cluster:default" cannot get pods in the namespace "spark-cluster"..

Looks like this is a service account permission issue. Kubernetes v 1.6 enables RBAC by default and the default service account does not have necessary permission for the driver. The driver needs "edit" privilege for constructing the executor pod spec.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache-spark-on-k8s/spark/issues/478#issuecomment-327519310, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3U59RTqq8tCXijDFvWEVywM_fdSPnUks5sfrkGgaJpZM4PMidG .

kimoonkim commented 7 years ago

A related fix is done by #451 that supports service account override. The documentation is also updated to show how a new service account can be created and used. But we never had new releases that include this fix. We may want to do that soon.

kimoonkim commented 7 years ago

@foxish Should we have new releases? I'll be happy to try out release processes if you need help.

foxish commented 7 years ago

A bugfix release would certainly be a good idea. It would need to be for 2.1 (which our docs still point to) and 2.2. Would also be good to add a couple of statements to the documentation about this.

On Sep 6, 2017 8:27 AM, "Kimoon Kim" notifications@github.com wrote:

@foxish https://github.com/foxish Should we have new releases? I'll be happy to try out release processes if you need help.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/apache-spark-on-k8s/spark/issues/478#issuecomment-327520694, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3U57JRfVT12rddUcuNFs_PotugZAdFks5sfrn2gaJpZM4PMidG .

foxish commented 7 years ago

Which kubectl version are you using? The flag may be available in a newer version.

On Sep 6, 2017 9:03 PM, "Jimmy Song" notifications@github.com wrote:

@kimoonkim https://github.com/kimoonkim On kubernetes 1.6.0

kubectl -n spark-cluster create clusterrolebinding spark-edit --clusterrole=edit --serviceaccount=spark-cluster:spark Error: unknown flag: --clusterrole

No such flag --clusterrole.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache-spark-on-k8s/spark/issues/478#issuecomment-327677939, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3U506mvLWajaFQuHKxNmM3SZEQn8H5ks5sf2r3gaJpZM4PMidG .

rootsongjc commented 7 years ago
kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.5", GitCommit:"894ff23729bbc0055907dd3a496afb725396adda", GitTreeState:"clean", BuildDate:"2017-03-23T16:14:43Z", GoVersion:"go1.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.0", GitCommit:"fff5156092b56e6bd60fff75aad4dc9de6b6ef37", GitTreeState:"clean", BuildDate:"2017-03-28T16:24:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

I deleted the 1.5.5 version and reinstall kubecetl 1.6.0 after that it works.

rootsongjc commented 7 years ago

@foxish @kimoonkim I build a new release on my local machine and submit a new job with it.

2017-09-08 10:02:19 INFO  SparkContext:54 - Running Spark version 2.1.0-k8s-0.3.1-SNAPSHOT
2017-09-08 10:02:20 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-09-08 10:02:20 ERROR SparkContext:91 - Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
2017-09-08 10:02:20 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)

But I didn't build new images, is there a doc about how to build the new docker images?

foxish commented 7 years ago

Only this https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#driver--executor-images. We are working on making the building of images and pushing them easier. Tracked in https://github.com/apache-spark-on-k8s/spark/issues/485.

rootsongjc commented 7 years ago

@foxish Every time when I submit a spark job to kubernetes I need to build a new docker image which has a jar file inside, that's such a hard way, we should figure out a easy way to simplify the process.

foxish commented 7 years ago

@rootsongjc, have you tried using the resource staging server?

rootsongjc commented 7 years ago

@foxish No I haven't use it.

foxish commented 7 years ago

https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dependency-management

rootsongjc commented 7 years ago

@foxish this problem solved, I think we can close the issue.