Azure / karpenter-provider-azure

AKS Karpenter Provider
Apache License 2.0
395 stars 66 forks source link

Karpenter stuck with pod `Pod should schedule on: nodeclaim/...` #432

Open darklight147 opened 4 months ago

darklight147 commented 4 months ago

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.0.0

Expected Behavior

Create a new node

Actual Behavior

image

Show the above message when describing a Pod but doesn't create any new Nodes

Steps to Reproduce the Problem

AKS Cluster with node auto provisioning enabled

Scale a deployment Nginx for example to 20 with memory request 8Gi

Resource Specs and Logs

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "12393960163388511505"
    karpenter.sh/nodepool-hash-version: v2
    kubernetes.io/description: General purpose NodePool for generic workloads
    meta.helm.sh/release-name: aks-managed-karpenter-overlay
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-07-11T00:34:44Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: karpenter-overlay-main-adapter-helmrelease
    helm.toolkit.fluxcd.io/namespace: 668f23a48709cf00012ccf73
  name: default
  resourceVersion: "485239"
  uid: 19ab7243-b704-4a1c-b9d8-279243f12865
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
status:
  resources:
    cpu: "16"
    ephemeral-storage: 128G
    memory: 106086Mi
    pods: "110"

Community Note

Bryce-Soghigian commented 4 months ago

Looking at the cluster id you shared, I see logs like creating instance, insufficient capacity, regional on-demand vCPU quota limit for subscription has been reached. To scale beyond this limit, please review the quota increase process here: https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

If you do kubectl get events | grep karp, do you see any events like this?

Bryce-Soghigian commented 4 months ago

You can unblock your scaleup by requesting additional quota on the subscription for that region via following the steps in this link https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

darklight147 commented 4 months ago

@Bryce-Soghigian Empty events from the command

image

Also here is the current Quota sorted by Current usage

image
darklight147 commented 4 months ago

@Bryce-Soghigian hey any update on this? thank you 🚀

maulik13 commented 3 months ago

We are seeing a similar behaviour in our cluster. Node claims are created, but they do not get in to the ready state. We do not see anything special in the events. It only says "Pod should schedule on: nodeclaim/app-g5gcw" and "Cannot disrupt NodeClaim" for existing nodes.

We have also checked that we have not reached our quota.

maulik13 commented 3 months ago

Ref: #438 running az aks update -n cluster -g rg fixed the issue for us.

maulik13 commented 4 days ago

We are still seeing this behavior time to time where nodeclaims are created resulting in creation of new VMs, but they do not manage to join the cluster and get in to a Ready = true state. How do we debug this or provide you with logs? The only solution is to reconcile the state by running an empty update against the cluster.