Azure / AKS

Azure Kubernetes Service
1.9k stars 283 forks source link

[BUG] AKS Istio Addon does not support tolerations #3882

Open amolvgaikwad opened 8 months ago

amolvgaikwad commented 8 months ago

Describe the bug We have aks cluster which madeup with multiple nodepools and all nodes are tainted, due to taints aks-addon not able to schedule the pods on any node.

To Reproduce Once you enable istio service mesh you could see pods will stuck in pending state.

Expected behavior After enabling the aks istio addon pods should be up and running state. You may provision to add custom toleration for pod while enabling the service mesh.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information): Can occur on any cluster with tainted nodes.

SatyKrish commented 7 months ago

Istio add-on must tolerate the taint CriticalAddonsOnly=true:NoSchedule, so that Istio controlplane components are deployed to system nodepool. https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-powershell#add-a-dedicated-system-node-pool-to-an-existing-aks-cluster

Vegoo89 commented 3 months ago

Hi, is there any update on this issue? This is blocking us from using this feature and migrating from OSM to managed Istio.

amolvgaikwad commented 3 months ago

Hi, is there any update on this issue? This is blocking us from using this feature and migrating from OSM to managed Istio.

No fix available yet, you need to patch the deployment.

kubectl patch deploy -n aks-istio-system istiod-asm-1-17 --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {CriticalAddonsOnly: Exists}}]'

Vegoo89 commented 3 months ago

Hi, is there any update on this issue? This is blocking us from using this feature and migrating from OSM to managed Istio.

No fix available yet, you need to patch the deployment.

kubectl patch deploy -n aks-istio-system istiod-asm-1-17 --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {CriticalAddonsOnly: Exists}}]'

It keeps removing and creating deployment again in random intervals, so it doesn't look like a valid option for production.

shashankbarsin commented 3 months ago

@amolvgaikwad - affinity is already specified for istiod pods to prefer scheduling on system node pools.

  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: kubernetes.azure.com/mode
            operator: In
            values:
            - system

Like all other addons, instead of supporting custom taints/tolerations in the addon enable API, it is recommend to create system node pools for your cluster.

Vegoo89 commented 3 months ago

@shashankbarsin

Documentation of AKS specifies explicitly that mentioned taint should be used for system node pools, so Microsoft critical cluster components are scheduled there: https://learn.microsoft.com/en-us/azure/aks/use-system-pools?tabs=azure-cli#add-a-dedicated-system-node-pool-to-an-existing-aks-cluster

Terraform for azurerm_kubernetes_cluster has this as parameter as well: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster#only_critical_addons_enabled

I understand that affinity is more flexible, but this is AKS addon. Adding this specific toleration doesn't break anything if nodes don't have taints added.

ToniA commented 1 week ago

Yeap, somewhat frustrating to notice that this hasn't changed since I tried it last time in October. A test cluster with spot-only nodepools. Istio cannot be scheduled to any node by default:


NAME                               READY   STATUS    RESTARTS   AGE
istiod-asm-1-20-596c74f449-62pkc   0/1     Pending   0          2m42s
istiod-asm-1-20-596c74f449-6sckd   0/1     Pending   0          2m53s```
Vegoo89 commented 1 week ago

After waiting for months for any reaction I can only recommend anyone that would like to use managed Istio and is missing critical features - don't wait and just install Istio yourself from Helm chart.

Adding tolerations to yaml takes 1 minute, and this issue is opened for 6 months. Imagine what would happen if there was be a bug that needs instant fix.

biefy commented 1 week ago

Hi @Vegoo89 and @ToniA , I am sorry for the delay and the frustration you feel. This issue just come into my attention this morning. We are looking into it now. In the mean time, I am assigning this issue to myself to track.