Open foxish opened 7 years ago
Another one that we were testing: executor pod death during execution. Will share a doc about the experiment.
I just encountered this case:
The current behavior is odd. The client reports "Succeeded" phase. I was expecting it to say failure:
2017-02-20 10:02:59 INFO LoggingPodStatusWatcher:54 - Application status for spark-pi-1487613731008 (phase: Succeeded)
2017-02-20 10:02:59 INFO LoggingPodStatusWatcher:54 - Phase changed, new state:
pod name: spark-pi-1487613731008
namespace: default
labels: spark-app-id -> spark-pi-1487613731008, spark-app-name -> spark-pi, spark-driver -> spark-pi-1487613731008
pod uid: b5ead952-f796-11e6-80cf-02f2c310e88c
creation time: 2017-02-20T18:02:15Z
service account name: default
...
phase: Succeeded
2017-02-20 10:02:59 INFO Client:54 - Application spark-pi-1487613731008 finished.
Here's the driver pod log:
2017-02-20 18:02:58 INFO SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Executor cannot find driver pod
at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:88)
at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.<init>(KubernetesClusterSchedulerBackend.scala:82)
at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:34)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2710)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:504)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2462)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:61)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:52)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:195)
at org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend.liftedTree1$1(KubernetesClusterSchedulerBackend.scala:84)
... 11 more
...
2017-02-20 18:02:58 INFO ShutdownHookManager:54 - Shutdown hook called
2017-02-20 18:02:58 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-a9d44da5-24d6-48ee-9f5e-7a32b3a6263d
2017-02-20 18:02:58 INFO KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Spark application complete. Shutting down submission server...
2017-02-20 18:02:58 INFO ServerConnector:306 - Stopped ServerConnector@2ca923bb{HTTP/1.1}{spark-pi-1487613731008:7077}
2017-02-20 18:02:58 INFO ContextHandler:865 - Stopped o.s.j.s.ServletContextHandler@53f3bdbd{/,null,UNAVAILABLE}
2017-02-20 18:02:58 INFO KubernetesSparkRestServer$KubernetesSubmitRequestServlet:54 - Received stop command, shutting down the running Spark application...
2017-02-20 18:02:58 INFO ShutdownHookManager:54 - Shutdown hook called
Looking at the rest server code, I see the driver exit code (the return value of process.waitFor
below) being ignored. Maybe we should let the rest server exit using the driver exit code if it is non-zero?
waitForProcessCompleteExecutor.submit(new Runnable {
override def run(): Unit = {
process.waitFor
SERVLET_LOCK.synchronized {
logInfo("Spark application complete. Shutting down submission server...")
KubernetesSparkRestServer.this.stop
shutdownLock.countDown()
}
}
})
Just filed a ticket for the exit code propagation -- thanks for finding @kimoonkim !
We should have tests (or for now, do manual tests) which verify the following states are handled in an intuitive way, the stray resources cleaned up, etc and document the common errors.