apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

Filter out unschedulable nodes for the driver endpoint more rigorously #73

Open mccheah opened 7 years ago

mccheah commented 7 years ago

69 filters the nodes to contact for submitting the application to those which are not marked as "unschedulable". However we have found that in a few cases this condition alone may not filter out all the nodes that are not running the kube-proxy. We should look into the full set of conditions to check there.

foxish commented 7 years ago

kubeadm is still alpha. Using the annotation to mark master nodes as unschedulable is subject to change. I'll wait for it to stabilize before adding any additional filtering.

ash211 commented 7 years ago

@foxish is there harm in filtering out the various draft versions of "unschedulable" that are being used -- at least the one kubeadm uses at present? I'd like to aim for Spark to work well out of the box on most all k8s clusters regardless of the various differences in how master node tainting works at present.

foxish commented 7 years ago

I'd like to aim for Spark to work well out of the box on most all k8s clusters regardless of the various differences in how master node tainting works at present.

Makes sense. I'm waiting for one discussion here https://github.com/kubernetes/kubernetes/pull/39112 to complete. Following that, (and before our alpha), I'll capture all the various ways to detect a master node.

mccheah commented 7 years ago

I think this is taken care of? Anything else to add here @foxish ?

foxish commented 7 years ago

@mccheah Still some work to be done here, around finding the right annotations allowing us to filter out master nodes for clusters brought up using kops, minikube, kubeadm, etc. There's a lot of discussion around this area. I'll have a PR out to fix this in a day or two.

mccheah commented 7 years ago

Ah yes, mixed it up with the filtering of IP types from the nodes. Yes, this does seem beneficial to sort out for the first pass.

ash211 commented 7 years ago

@foxish maybe for the alpha we can provide docs for how to set the right taint on the master node that the fabric8 client looks for?

Even if the value isn't standardized right now, we're relying on one in particular so should make that clear.

I ask this also with self-interest because I've been waiting through failovers in my testing ever since we introduced this change and I'd love to set the taint right on my dev cluster.

foxish commented 7 years ago

Based on what I'm seeing in https://github.com/kubernetes/kubernetes/pull/31647, the right annotation to set on the node is:

  annotations:
    scheduler.alpha.kubernetes.io/taints: '[{"key":"node.alpha.kubernetes.io/ismaster","effect":"NoSchedule"}]'

It is still experimental and subject to change, but will likely unify in the future.

ash211 commented 7 years ago

Is that the key that fabric8 looks for in this call? https://github.com/apache-spark-on-k8s/spark/blob/k8s-support-alternate-incremental/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/kubernetes/NodePortUrisDriverServiceManager.scala#L52

That's the unschedulable I'm hoping to provide docs for

foxish commented 7 years ago

The old way of marking a node as unschedulable was:

(1)

apiVersion: v1
kind: Node
...
spec:
  unschedulable: true

The new method of achieving the above is:

(2)

apiVersion: v1
kind: Node
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/taints: '[{"key":"node.alpha.kubernetes.io/ismaster","effect":"NoSchedule","timeAdded":null}]'

fabric8 client supports (1) through https://github.com/apache-spark-on-k8s/spark/blob/k8s-support-alternate-incremental/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/kubernetes/NodePortUrisDriverServiceManager.scala#L52. It has no notion or support for (2).

What I think we should do is:

  1. I'll send a PR which makes our spark-on-k8s code respect (2) which should cover almost all kinds of cluster setups.
  2. In the documentation, we can suggest that IF people face issues because the correct taint doesn't exist on their master node, and spec.unschedulable is unset, users can manually mark their nodes as unschedulable by editing the node spec to match (1).

Does that make sense?