dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 149 forks source link

Add default toleration to worker pod #109

Closed dsludwig closed 5 years ago

dsludwig commented 5 years ago

I have setup an instance of dask-kubernetes that uses taints to prevent kube-system components from scheduling onto the nodes. This really helps when using Google's autoscaler, since empty nodes are easily batch deleted.

The configuration that I needed to add is here: https://github.com/pangeo-data/atmos.pangeo.io-deploy/blob/0ddab66e1a37f9f2a9ac85c8fdad7301b9e3882d/deployments/atmos.pangeo.io/image/.dask/config.yaml#L36-L44

      tolerations:
      - key: "k8s.dask.org_dedicated"
        operator: "Equal"
        value: "worker"
        effect: "NoSchedule"
      - key: "k8s.dask.org/dedicated"
        operator: "Equal"
        value: "worker"
        effect: "NoSchedule"

It might be nice to add a default toleration, so we standardize what taints are recommended for using dask-kubernetes.

mrocklin commented 5 years ago

cc @yuvipanda @jacobtomlinson

On Fri, Nov 9, 2018 at 3:54 PM Derek Ludwig notifications@github.com wrote:

I have setup an instance of dask-kubernetes that uses taints to prevent kube-system components from scheduling onto the nodes. This really helps when using Google's autoscaler, since empty nodes are easily batch deleted.

The configuration that I needed to add is here:

https://github.com/pangeo-data/atmos.pangeo.io-deploy/blob/0ddab66e1a37f9f2a9ac85c8fdad7301b9e3882d/deployments/atmos.pangeo.io/image/.dask/config.yaml#L36-L44

  tolerations:
  - key: "k8s.dask.org_dedicated"
    operator: "Equal"
    value: "worker"
    effect: "NoSchedule"
  - key: "k8s.dask.org/dedicated"
    operator: "Equal"
    value: "worker"
    effect: "NoSchedule"

It might be nice to add a default toleration, so we standardize what taints are recommended for using dask-kubernetes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/109, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszG5XA733IcPwyluNmyiYq_YYcjjgks5uteuhgaJpZM4YXf4l .

jacobtomlinson commented 5 years ago

This is interesting. I've not encountered problems with kube-system components stopping scale down. Which component specifically doesn't move?

In terms of dedicated nodes for dask workers I think that is a good idea and these toleration names look sensible. Please feel free to raise a PR.

mrocklin commented 5 years ago

If we eventually have both schedulers and workers deployed as pods would chat change our toleration/taint names? I can imagine wanting to place scheduler and workers on different node pools.

jacobtomlinson commented 5 years ago

Yes sounds sensible. We currently place notebooks and workers differently on Pangeo, I can imagine schedulers should be separate from workers too.

We may want to consider affinities and anti-affinities instead of taints and tolerances though as they are more flexible.

dsludwig commented 5 years ago

If we eventually have both schedulers and workers deployed as pods would chat change our toleration/taint names?

Not the name, but the value. For example, we would have another taint k8s.dask.org/dedicated=scheduler.

This is interesting. I've not encountered problems with kube-system components stopping scale down. Which component specifically doesn't move?

These are some of the ones we've seen: tiller, heapster, metrics-server. I think some of those are GKE specific, so that might be why you haven't seen it on your cluster, @jacobtomlinson.

We had a longer discussion about this here: https://github.com/pangeo-data/pangeo/issues/322#issuecomment-427449658, and one of the cluster autoscaler developers endorsed the idea of using taints/tolerations to solve the problem.

consideRatio commented 5 years ago

@jacobtomlinson I've had trouble with kube-dns pods for example, it comes from a deployment rather than a daemonset and can prevent scale down. @dsludwig I think heapster comes from a daemonset so that would not prevent scale down, while tiller would, and metrics-server i don't know about.

@mrocklin I'm making a PR to add these tolerations to the worker-deployment of the stable/dask helm chart: https://github.com/helm/charts/pull/9529

jacobtomlinson commented 5 years ago

The kube-dns pods shouldn't cause an issue as long as there is enough slack in the pod disrpution budget. Which I would imagine there should be out of the box, but if there isn't then I would recommend updating it yourself.