Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.94k stars 302 forks source link

AKS issue for taint removal #2934

Closed pedroamaralm closed 1 month ago

pedroamaralm commented 2 years ago

What happened:

Hi, until last month I was able to remove the taint from my nodes but since then I can't anymore, I haven't made any changes, either in the nodes or in politics, I just can't.

The following problem occurs:

p@p:~$ kubectl taint nodes aks-lxnode1-xxxxxxx-vmss00000y kubernetes.azure.com/scalesetpriority=spot:NoSchedule- Error from server: admission webhook "aks-node-validating-webhook.azmk8s.io" denied the request: (UID: XXXXXX-XXXXX-XXXX-XXXX) Taint delete request "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" refused. User is attempting to delete a taint configured on aks node pool "lxnode1".

What you expected to happen:

If I use this command on the cluster I have elsewhere I can make it work normally

k8s@node1:~$ kubectl taint nodes node2 kubernetes.azure.com/scalesetpriority=spot:NoSchedule node/node2 tainted k8s@node1:~$ kubectl taint nodes node2 kubernetes.azure.com/scalesetpriority=spot:NoSchedule- node/node2 untainted

How to reproduce it (as minimally and precisely as possible):

kubectl taint nodes aks-lxnode1-xxxxxxxx-vmss00000y kubernetes.azure.com/scalesetpriority=spot:NoSchedule-

Anything else we need to know?:

I use spot.io to manage spot instances and when the instance is created it is born with the taint In the Schedule, we have the job that removed this taint but now it doesn't remove it anymore and as a user we can't remove it, I also have a rancher installed on it cluster where I'm Administrator but that still won't let me remove the taint. This problem is not related to spot.io or the machines being spot because I uploaded a new cluster from scratch, with on-demand machines and after inserting the taint I couldn't remove it either.

davem-git commented 1 year ago

has this been provided yet? I haven't see anything in the release notes on it

Vipersoft-01 commented 1 year ago

This issue is still present as I'm aware. They have adressed that its actually an intentional thing to make sure its not being untainted by default or so easily untainted if you wanted it to. I found it hard to believe that it has anything to do with 'normal' behaviour rather than limiting users that what kubernetes is actually made for. Micromanaging everything, even it used resources, should give users the required freedom so use it as it should be, even when using platforms like AKS.

7wingfly commented 1 year ago

I've opened a case with microsoft about it but I'm not really expecting a resolution at this point, based on what I've read in this thread. For now (or more likely forever) I will get around it by adding a default toleration at the namespace level, for every namespace.

apiVersion: v1
kind: Namespace
metadata:
  name: thenamespacename
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"Key": "kubernetes.azure.com/scalesetpriority", "Operator": "Equal", "Value": "spot", "Effect": "NoSchedule"}]'
ghost commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

davem-git commented 1 year ago

maybe stale, but not something that should be closed. This is an issue that needs to be resolved

megakid commented 1 year ago

I would recommend those coming to this issue to resolve this via a Pod mutation (to add the spot toleration) using Kyverno - https://kyverno.io/policies/other/add-tolerations/add-tolerations/. It works so much better, and the biggest plus is that AKS cluster auto-scaling now works (since the pods requesting resource now tolerate spot)

aservedio commented 1 year ago

I like kubelet startup taint ideas but I wish that it was a K8 solution, not an AKS specific solution.

For adding node taints via aks api and removing them via k8 api, that's supported in Aws, Gcp, Rancher, Linode, Tencent and more I suspect.

Very useful to initialize pods before other nodes. The only method that works accross all providers until I tried here. Those are not causing reconciliation issues on them and are likely just ignored.

Any chance we could follow that flow in AKS as well or too late?

Regards

aservedio commented 1 year ago

Also I noticed when using terraform to create aks cluster and nodepools that when using priority=Spot, that a spot related taint was automatically added to the nodepool which is pretty weird since labels exist to let you pick nodes to use for pods and you can do taints yourself when you need it, but now I'm forced to deal with that taint which others providers don't do as well.

Plus when I terraform apply such spot nodepool twice, TF gets into conflict with that taint that appeared.

I'd say AKS is creating reconciling issues for users more then avoiding them 😄

aservedio commented 1 year ago

To summarize and be more constructive about this since it's a major blocker for this important project I've been working on at work:

For my project requirement, I have the interesting and fun challenge of making it work on all top providers. I got 4 so far which all works the same. And recently the project and other projects using it have much more incentive to use AKS, so we have been starting to work on supporting it. And so far it looks promising and 95% working. But we are blocked for one small part.

Basically we package our solution in a helm chart, and it has daemonsets that will configure any current and new nodes joining the cluster that are matching certain labels, so they are ready to pull a variety of large gameserver images in an optimized way. And the only super reliable, cross-provider and very simple way we found after having trying many different things in live environments, was to add a temporary taint ignore-taint.cluster-autoscaler.kubernetes.io/whatever-you-like to nodepools, and have our daemonset remove it once node is ready.

This works very well on Aws, Gcp, Rancher, Linode and Tencent so far. But on AKS we get that error that we cannot remove our own taint using K8 api. (Not sure if AKS autoscaler has that logic though)

I'm pretty sure any project that wants to setup special node is using this method too which those providers allows for those special node setup purposes.


So yeah, here's my rough suggestion:

At the AKS nodepool api level:

MarkDeckert commented 1 year ago

This is actually very simple, but it requires a certain respect for the customer.

Microsoft, don't put arbitrary taints on nodepools, ever. If you think a nodepool should have a taint as a best practice - note it in the documentation, or if it's so critical, put a warning somewhere during provisioning.

andreiZi commented 10 months ago

Are there any updates on this? I'm looking to utilize spot instances for my customers' dev and test environments. Unfortunately, I rely heavily on resources like Consul, which perform numerous side tasks. To configure them, I currently have to fork the original Helm chart, which isn't ideal.

A big thank you to @7wingfly. Your workaround was a lifesaver! However, it would be great if Microsoft could address this issue, as it presents a significant limitation for some of us.

I've opened a case with microsoft about it but I'm not really expecting a resolution at this point, based on what I've read in this thread. For now (or more likely forever) I will get around it by adding a default toleration at the namespace level, for every namespace.

apiVersion: v1
kind: Namespace
metadata:
  name: thenamespacename
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"Key": "kubernetes.azure.com/scalesetpriority", "Operator": "Equal", "Value": "spot", "Effect": "NoSchedule"}]'
aservedio-perso commented 10 months ago

Looking at the next "obvious" alternative to go around this different behavior from AKS vs all other major and medium k8 clusters providers:

Anyone knows if we would be able to create a core node mutating webhook to achieve this, or is there another system that will block that too from either k8 or aks?

Regards,

allyford commented 10 months ago

Listening to the feedback on this thread it seems all use cases that were using this prior behavior are along the lines of node initialization. We can add startup taints, passed to the kubelet in order to meet this use case. They will not be enforced by AKS, or passed on the AKS API but on the kubelet custom settings, and you can then remove them via k8s API call, because they won't be reconciled by the API. Does anyone have a use case that this wouldn't meet?

Otherwise we can provide this asap

See #3464 for Startup taints feature currently in development. I'll be updating on that issue with release timelines and happy to answer additional questions there.

aservedio commented 10 months ago

Startup taints passed to kubelet?

As in create a cluster, then edit some configmap somewhere or patch it and somehow target a nodepool by what, labels set using api?

While everywhere else you just create a cluster and pass relevant options to achieve this?

It sounds like an aks specific workaround to be honest :/

aservedio commented 10 months ago

In the end what would be absolutely amazing is doing this thru the api and the various toolings around them like terraform or pulumi, to create clusters with taints and labels that can be fully managed and deleted or modified using k8 api later, without this aks specific piece of the logic that blocks k8 actions on them later, contrary to the other 6 providers I'm working with every week.

Doing otherwise means at least for our project and users of it that we and them would need to starts modifying their cluster creation flow with something very outside the usual boxes / new extra flow, specifically for AKS.

Also I think from what I'm seeing on recent gcp updates on that subject and their api, they might be going the route of 2 set of options in their api for both labels and taints. One that can't be changed later and one that can.

Not an unreasonable way IMO.

aservedio commented 9 months ago

Positive update: a node mutating hook on aks and aws so far is able to intercept node create event and mutate/patch its labels and/or taints.

It puts the bar high to do something that is so trivial elsewhere but at least that seems to work.

hacst commented 6 months ago

This issue is neither stale nor solved. @allyford maybe you can give an update on the implementation progress? We are still waiting for a fix for this regression and some of the workarounds actually have ongoing cost to us.

redaER7 commented 6 months ago

az aks nodepool update -g $RESOURCE_GROUP -n $NODEPOOL --cluster-name $CLUSTER \ --node-taints "" worked for me from https://learn.microsoft.com/es-es/cli/azure/aks/nodepool?view=azure-cli-latest

Pase la cadena "" vacía para quitar todos los taints.

ritesh-makerble commented 6 months ago

Any update on this issue? How can the taint be removed at nodepool level

mattiasb commented 5 months ago

az aks nodepool update -g $RESOURCE_GROUP -n $NODEPOOL --cluster-name $CLUSTER \ --node-taints "" worked for me from https://learn.microsoft.com/es-es/cli/azure/aks/nodepool?view=azure-cli-latest

That solves a subset of the issues talked about in this thread.

What me, @hacst and many more want is the ability to assign a taint to a node-pool such that new nodes start with that taint and that we are able to remove the taint from nodes at a later point (for example as part of a DaemonSet that does some prepatation on the node, like warms up a cache etc).

hterik commented 4 months ago

Ability to set taints on nodepool level has the potential to create scaleup loops, in case the taints are not removed quickly enough by user, and autoscaler keeps trying to start new nodes because pods could not be scheduled.

Autoscaler provides a feature to prevent such scenario, using --ignore-taints argument, this option is not exposed by AKS today, opening that up need to be part of feature to set nodepool taints. See https://github.com/Azure/AKS/issues/3276

charlienewey-odin commented 3 months ago

My application is slightly different from others in this thread, but have had success with @megakid's approach of using Kyverno & suspect it can generalise quite well.

I am using NVIDIA's GPU Operator to configure GPU drivers and MIGs (multi-instance GPUs) - but the GPU Operator requires a node reboot before MIG can be used. The problem arose because the scheduler wouldn't notice that MIG was not configured and would therefore schedule jobs on the node just before it rebooted (this caused an elevated rate of job failure).

I worked around this by adding 2 Kyverno ClusterPolicy definitions that apply or remove the Unschedulable property based on the presence/absence of labels indicating the node's MIG state. This should be possible to generalise to other situations.

Note: I had to configure the Kyverno deployment to be able to manage Node objects by removing them from resourceFilters in the kyverno config map (can be done via values.yaml in the Helm chart).

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: aks-mig-apply-unready-taint
  annotations:
    policies.kyverno.io/title: Apply Taint to Unready MIG Nodes
    policies.kyverno.io/category: Other
    policies.kyverno.io/subject: Node
    kyverno.io/kyverno-version: 1.7.2
    policies.kyverno.io/minversion: 1.7.0
    policies.kyverno.io/description: >-
      AKS nodes that are configured with the NVIDIA GPU Operator and have a MIG profile
      defined need to be rebooted in order to use the feature. This policy adds the taint
      `mig=notReady:NoSchedule` if the node is configured with a MIG profile and the label
      `nvidia.com/mig.config.state=success` is not present.
spec:
  rules:
    - name: aks-mig-apply-unready-taint
      match:
        resources:
          kinds:
            - Node
          operations:
            - CREATE
            - UPDATE
          selector:
            matchLabels:
              nvidia.com/mig.config: "*"
      exclude:
        resources:
          kinds:
            - Node
          selector:
            matchLabels:
              nvidia.com/mig.config.state: "success"
      mutate:
        patchesJson6902: |-
          - path: /spec/unschedulable
            op: add
            value: true
          - path: /metadata/labels/mig
            op: add
            value: notReady
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: aks-mig-remove-unready-taint
  annotations:
    policies.kyverno.io/title: Remove Taint From Ready MIG Nodes
    policies.kyverno.io/category: Other
    policies.kyverno.io/subject: Node
    kyverno.io/kyverno-version: 1.7.2
    policies.kyverno.io/minversion: 1.7.0
    policies.kyverno.io/description: >-
      AKS nodes that are configured with the NVIDIA GPU Operator and have a MIG profile
      defined need to be rebooted in order to use the feature. This policy removes the
      `mig=notReady:NoSchedule` taint if the label `nvidia.com/mig.config.state=success`
      is present.
spec:
  rules:
    - name: aks-mig-remove-unready-taint
      match:
        resources:
          kinds:
            - Node
          operations:
            - CREATE
            - UPDATE
          selector:
            matchLabels:
              nvidia.com/mig.config: "*"
              nvidia.com/mig.config.state: "success"
      mutate:
        patchesJson6902: |-
          - path: /spec/unschedulable
            op: add
            value: false
          - path: /metadata/labels/mig
            op: replace
            value: ready
microsoft-github-policy-service[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.

microsoft-github-policy-service[bot] commented 1 month ago

This issue will now be closed because it hasn't had any activity for 7 days after stale. pedroamaralm feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.