dataiku / dss-plugin-eks-clusters

Apache License 2.0
4 stars 6 forks source link

Adding labels/taints support, allow for adding more than one node pool at cluster creation and use latest Nvidia driver #62

Closed vrutz closed 5 months ago

vrutz commented 6 months ago

Re-introducing labels and taints support for EKS node groups, updating Nvidia driver to latest and allowing for definition of multiple node groups at cluster creation (while keeping current initial node group definition in config)

When supporting taints on node groups, we have to take into account GPU enabled groups. The Nvidia driver daemonset contains a toleration for the nvidia.com/gpu taint to exist on the nodes it can be applied to, but it needs to be patched for other custom taints.

Regarding taints and the autoscaler, we have to patch the autoscaler configuration with the taints of the autoscaled node pools as tolerations AND we have to add the propagateASGTags to the node pool configuration. This flag will add the labels and taints of the node group to the AutoScaling Group (ASG) as tags. These tags are used to label and taint new nodes on scaling up (new nodes created).

[sc-174884] [sc-165108] [sc-167294] [sc-177800]

shortcut-integration[bot] commented 6 months ago

This pull request has been linked to:

vrutz commented 5 months ago

OK for me but version bump missing in plugin.json

Tested

  • Deploy DSS 12.5.2 via FM in elastic fleet
  • Update plugin with archive built from PR (commit 127439e98aa2cb8310c3f46a718c8f0f43a5b9b2)
  • Create cluster with: 1 node pool simple, 1 node pool with GPU and label, 1 node pool with autoscaler and taints

    • ✅ autoscaler is deployed
    • ✅ nvidia daemon set is present for each node pool
    • ✅ nvidia daemon set version is correct
    • ✅ label is populated correctly
    • ✅ taint is populated correctly
    • ✅ all 3 node pools are created in the cluster
  • ✅ release notes and version bump in changelog
  • ❌ version bump in plugin.json?

We are not going to release the plugin right away and probably push it back to 13.0 and bundle it with more changes. Not sure if we should bump the version then

thtrunck commented 5 months ago

We have the question of supporting also annotation, so let's clear that up with FEs and add it if needed

vrutz commented 5 months ago

We have the question of supporting also annotation, so let's clear that up with FEs and add it if needed

I think annotations could be in their own PR. We are fixing the taints issue here because it'd be a bug, but annotations deserve a new Story card and probably a new PR since this one is already doing 4 different SCs

shortcut-integration[bot] commented 5 months ago

This pull request has been linked to Shortcut Story #177800: [AKS] Allow Nvidia daemon to be installed on GPU enabled machines with taints.

vrutz commented 5 months ago

Also tested that the autoscaler works by forcing load on a lot of pods onto the cluster and I did see the autoscaler updating the capacity

image
amandineslx commented 5 months ago

all good for me