Verify various error conditions are handled correctly

foxish commented 7 years ago

We should have tests (or for now, do manual tests) which verify the following states are handled in an intuitive way, the stray resources cleaned up, etc and document the common errors.

[ ] service creation fails
[ ] secret creation fails
[ ] driver pod fails to start
[ ] local file submission fails
- [ ] nodeport/ingress access issue (https://github.com/apache-spark-on-k8s/spark/issues/91)
- [ ] file not found
[ ] main jar in local:// URL file not found
[ ] main jar in http:// URL not found
[ ] file submit succeeds and then driver fails
[ ] driver unable to launch executor pods
[ ] executor pods never enter running state
[ ] executor pods end in error states (including eviction)
[ ] driver pod ends in error state (including eviction)

ssuchter commented 7 years ago

Another one that we were testing: executor pod death during execution. Will share a doc about the experiment.

kimoonkim commented 7 years ago

I just encountered this case:

[ ] driver unable to launch executor pods

The current behavior is odd. The client reports "Succeeded" phase. I was expecting it to say failure:

2017-02-20 10:02:59 INFO  LoggingPodStatusWatcher:54 - Application status for spark-pi-1487613731008 (phase: Succeeded)
2017-02-20 10:02:59 INFO  LoggingPodStatusWatcher:54 - Phase changed, new state:
     pod name: spark-pi-1487613731008
     namespace: default
     labels: spark-app-id -> spark-pi-1487613731008, spark-app-name -> spark-pi, spark-driver -> spark-pi-1487613731008
     pod uid: b5ead952-f796-11e6-80cf-02f2c310e88c
     creation time: 2017-02-20T18:02:15Z
     service account name: default
         ...
     phase: Succeeded
2017-02-20 10:02:59 INFO  Client:54 - Application spark-pi-1487613731008 finished.

Here's the driver pod log:

2017-02-20 18:02:58 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Executor cannot find driver pod
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:88)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:82)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:34)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2710)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:504)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2462)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:61)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:52)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:195)
    at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:84)
    ... 11 more

...

2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Shutdown hook called
2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-a9d44da5-24d6-48ee-9f5e-7a32b3a6263d
2017-02-20 18:02:58 INFO  KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Spark application complete. Shutting down submission server...
2017-02-20 18:02:58 INFO  ServerConnector:306 - Stopped ServerConnector@2ca923bb{HTTP/1.1}{spark-pi-1487613731008:7077}
2017-02-20 18:02:58 INFO  ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@53f3bdbd{/,null,UNAVAILABLE}
2017-02-20 18:02:58 INFO  KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Received stop command, shutting down the running Spark application...
2017-02-20 18:02:58 INFO  ShutdownHookManager:54 - Shutdown hook called

Looking at the rest server code, I see the driver exit code (the return value of process.waitFor below) being ignored. Maybe we should let the rest server exit using the driver exit code if it is non-zero?

                waitForProcessCompleteExecutor.submit(new Runnable {
                  override def run(): Unit = {
                    process.waitFor
                    SERVLET_LOCK.synchronized {
                      logInfo("Spark application complete. Shutting down submission server...")
                      KubernetesSparkRestServer.this.stop
                      shutdownLock.countDown()
                    }
                  }
                })

ash211 commented 7 years ago

Just filed a ticket for the exit code propagation -- thanks for finding @kimoonkim !

apache-spark-on-k8s / spark

Verify various error conditions are handled correctly #119