apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Possible incorrect behavior of KubernetesClusterSchedulerBackend.doKillExecutors #429

Open satybald opened 7 years ago

satybald commented 7 years ago

I'm new to the project and might be wrong but according to the class CoarseGrainedSchedulerBackend.doKillExecutors should return whether the kill request is acknowledged. However, KubernetesClusterSchedulerBackend.doKillExecutors always returns true even if the pods does not exist. Which leads me to think that doKillExecutors are not adhering to parent class behavior.

satybald commented 7 years ago

What do you folks think? cc @varunkatta @kimoonkim @mccheah

erikerlandson commented 7 years ago

As I understand it, the kill operation is a request to the cluster API, which "succeeds" - knowledge of whether the corresponding pod eventually gets taken down is delayed, and not something we'd want to block on in doKillExecutors

satybald commented 7 years ago

what I see in the code doKillExecutors is used by killExecutors to adjust a number of executors.

I'm thinking if in case of such list: List(nonExistingPod1, nonExistingPod2, .. nonExistingPodN), doKillExecutors always ACK killing, which leads to the wrong number of executors.

satybald commented 7 years ago

but if folks sure that it's proper behavior, feel free to close an issue.

varunkatta commented 7 years ago

Per my understanding, KubernetesClusterSchedulerBackend.doKillExecutors eventually returns true, if the kill request has been acked and performed, and in this implementation, this is always true, so returning true seems to be the right behavior. Whether the request is legitimate or not or how many pods actually get killed is not of concern or captured in the return value, per the method contract.

CoarseGrainedSchedulerBackend.killExecutors first adjusts the number of executors, and then attempts to kill executors. Adjustment needs to happen before kill attempts. That is the higher order bit. The method returns the list of executors it tried to kill, if it gets an ack from the Cluster Backend that Cluster Backend attempted to kill the passed-in executors else it will return an empty list.

KubernetesClusterSchedulerBackend.doKillExecutors return behavior seems OK to me.

@mccheah is the original author; please chime-in, if I am totally off the mark here.

satybald commented 7 years ago

Thanks, @varunkatta and @erikerlandson for valuable comments. I'm still thinking if the function returns boolean, it should not return always true. Otherwise, the proper function type is void. Off course there's always a case for a bad API design :)

I've looked into the implementation of others cluster backends Messos and Standalone. I found that doKillExecutors in those implementations return false in case:

I think it would be better to return false in case of runningExecutorsToPods HashMap is empty or kubernetesClient is closed or null to properly adhere to a CoarseGrainedSchedulerBackend contract.

erikerlandson commented 7 years ago

I think returning false in the case of closed/missing kubernetesClient looks reasonable.

An empty runningExecutorsToPods seems like just a special case of one name missing from the table (i.e. all of the names are), and that is already handled in the code.

varunkatta commented 7 years ago

In case of Mesos and Standalone backends, it is theoretically possible that schedulerDriver object in the former and clientobject in the latter may not be initialized whendoKillExecutors is called as they are initialized in the start method. Although, practically, it is not clear to me what control flow would trigger such a behavior, that is doKillExecutors is called before start (only took a quick look). In case of KubernetesClusterSchedulerBackend, kubernetesClient is passed as a constructor argument and is expected to be initialized before the backend gets hold of it (or the SparkContext/Scheduler doesn't even start).

KubernetesCLusterSchedulerBackend#stop() closes the kubernetesClient. So, either something external to KubernetesClusterScheduleBackend is closing the client before calling doKillExecutors or stop() method is called first before calling doKillExecutors for doKillExecutorsto work on a closed kubernetesClient.

We could have defensive code blocks to address the control flow of calling doKillExecutors method with uninitialized or closed kubernetesClient. I don't feel strong about it though. It is not clear to me, if such control flows are possible. My hunch is we could leave the method as-is. It is also not clear to me, what are the api methods on kubernetesClient to check, if the client is initialized/valid/closed. I might have overlooked a few things, and it is entirely possible that having a defensive code block here is the right thing to do. If there are valid reasons, I am all for it.

Regarding, returning false, in case of empty runningExecutorsToPods, I concur with @erikerlandson