apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Verify various error conditions are handled correctly #119

Open foxish opened 7 years ago

foxish commented 7 years ago

We should have tests (or for now, do manual tests) which verify the following states are handled in an intuitive way, the stray resources cleaned up, etc and document the common errors.

ssuchter commented 7 years ago

Another one that we were testing: executor pod death during execution. Will share a doc about the experiment.

kimoonkim commented 7 years ago

I just encountered this case:

The current behavior is odd. The client reports "Succeeded" phase. I was expecting it to say failure:

2017-02-20 10:02:59 INFO  LoggingPodStatusWatcher:54 - Application status for spark-pi-1487613731008 (phase: Succeeded)
2017-02-20 10:02:59 INFO  LoggingPodStatusWatcher:54 - Phase changed, new state:
     pod name: spark-pi-1487613731008
     namespace: default
     labels: spark-app-id -> spark-pi-1487613731008, spark-app-name -> spark-pi, spark-driver -> spark-pi-1487613731008
     pod uid: b5ead952-f796-11e6-80cf-02f2c310e88c
     creation time: 2017-02-20T18:02:15Z
     service account name: default
         ...
     phase: Succeeded
2017-02-20 10:02:59 INFO  Client:54 - Application spark-pi-1487613731008 finished.

Here's the driver pod log:

2017-02-20 18:02:58 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:88)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:82)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:34)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2710)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:504)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2462)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:61)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:52)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:195)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:84)
    ... 11 more

...

2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-a9d44da5-24d6-48ee-9f5e-7a32b3a6263d
2017-02-20 18:02:58 INFO  KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Spark application complete. Shutting down submission server...
2017-02-20 18:02:58 INFO  ServerConnector:306 - Stopped ServerConnector@2ca923bb{HTTP/1.1}{spark-pi-1487613731008:7077}
2017-02-20 18:02:58 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@53f3bdbd{/,null,UNAVAILABLE}
2017-02-20 18:02:58 INFO  KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Received stop command, shutting down the running Spark application...
2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Shutdown hook called

Looking at the rest server code, I see the driver exit code (the return value of process.waitFor below) being ignored. Maybe we should let the rest server exit using the driver exit code if it is non-zero?

                waitForProcessCompleteExecutor.submit(new Runnable {
                  override def run(): Unit = {
                    process.waitFor
                    SERVLET_LOCK.synchronized {
                      logInfo("Spark application complete. Shutting down submission server...")
                      KubernetesSparkRestServer.this.stop
                      shutdownLock.countDown()
                    }
                  }
                })
ash211 commented 7 years ago

Just filed a ticket for the exit code propagation -- thanks for finding @kimoonkim !