Better logging in driver for OOMKilled pods

ash211 commented 7 years ago

See this in driver logs:

INFO  [2017-09-06T02:48:34.399Z] org.apache.spark.scheduler.cluster.kubernetes.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 2.
INFO  [2017-09-06T02:48:34.404Z] org.apache.spark.scheduler.DAGScheduler: Executor lost: 2 (epoch 1)
INFO  [2017-09-06T02:48:34.404Z] org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
INFO  [2017-09-06T02:48:34.405Z] org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 10.255.73.13, 38341, None)
INFO  [2017-09-06T02:48:34.405Z] org.apache.spark.storage.BlockManagerMaster: Removed 2 successfully in removeExecutor
INFO  [2017-09-06T02:48:34.406Z] org.apache.spark.scheduler.DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
ERROR [2017-09-06T02:48:45.917Z] org.apache.spark.scheduler.cluster.kubernetes.KubernetesTaskSchedulerImpl: Lost executor 2 on 10.255.73.13: Executor lost for unknown reasons

and this in pod describe (redacted a bit):

[aash@ip-10-0-5-201 ~]$ kubectl get pods ...
NAMESPACE                                               NAME                          READY     STATUS      RESTARTS   AGE
my-app-namespace                                        my-app-1504664813055-driver   1/1       Running     0          22m
my-app-namespace                                        my-app-1504664813055-exec-1   1/1       Running     0          21m
my-app-namespace                                        my-app-1504664813055-exec-2   0/1       OOMKilled   0          21m
my-app-namespace                                        my-app-1504664813055-exec-3   1/1       Running     0          21m
my-resource-staging-server-namespace                    resource-staging-server       1/1       Running     0          4d
[aash@ip-10-0-5-201 ~]$

The line in driver logs that says Executor lost for unknown reasons should instead say something about the pod being OOMKilled

kimoonkim commented 7 years ago

Hmm. Maybe @varunkatta? I thought the driver watches executors and tries to figure out loss reasons.

ash211 commented 7 years ago

This should be reproducible via setting spark.kubernetes.executor.memoryOverhead=0 and running a shuffle

varunkatta commented 7 years ago

Driver couldn't learn the loss reason in this case (may be because of missed K8s events ?). I will try to repro this and share what we discover in the process.

apache-spark-on-k8s / spark

Better logging in driver for OOMKilled pods #480