ChameleonCloud / chi-in-a-box

Packaging the systems and operations of the Chameleon testbed
Apache License 2.0
14 stars 10 forks source link

Soufianej taint k3s #291

Closed JOUNAIDSoufiane closed 2 months ago

JOUNAIDSoufiane commented 3 months ago

Added k8s worker taint configuration options

The options to add to site-config are:

Added taint tolerations to core deployments in k3s

k3s defaults.yaml gets the value of k3s_worker_taint from worker_taint which is defined under kolla/defaults.yml which subsequently defines defaults for the option and takes in site values from the k8s_worker_taint option that can be specified through the site config.

Added worker node taint toleration to smarter devices manager daemonsets.

Furthermore, templated the nvidia device plugin daemonset and added the toleration there as well.

Taint deployment strategy on a running testbed:

  1. Redeploy k3s playbook to apply tolerations to and relaunch the core daemonsets running on the worker nodes
  2. Set zun_tolerate_worker_taint to True and redeploy Zun
  3. Finally, set doni_enable_worker_taint to True and redeploy

The above sequence ensures that no existing or simultaneous user pods get evicted and inflicts minimal downtime to core daemonsets.

JOUNAIDSoufiane commented 3 months ago

When deploying K3S. taint device nodes to only tolerate non-control plane services and add tolerations for device plugins and other device specifics.

msherman64 commented 2 months ago

Requires code from:

Zun: https://github.com/ChameleonCloud/zun/pull/20 Doni: https://github.com/ChameleonCloud/doni/pull/138

Order to apply is:

  1. pull and deploy new zun
  2. pull and deploy new doni
  3. deploy this PR
msherman64 commented 2 months ago

note: add worker_node_taint with a | default(something) to kolla/defaults.yml