aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.62k stars 922 forks source link

Pods are not scheduled for p-type instance and P-type instances were not terminated after grace period seconds #2296

Closed go4real closed 2 years ago

go4real commented 2 years ago

Version

Karpenter: v0.14.0

Kubernetes: v1.21.14

Expected Behavior

Pods using GPU are scheduled to a provisioned p-type instance.

Actual Behavior

p-type instances were provisioned well. However, the pod were still on pending status. After deleting pending pods, the provisioned instance were not terminated even after termination grace period.

Steps to Reproduce the Problem

gpu_privisioner.yaml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu
spec:
  limits:
    resources:
      gpu: 100
  provider:
    securityGroupSelector:
      kubernetes.io/cluster/dev-blueprint: owned
    subnetSelector:
      aws-cdk:subnet-type: Private
  ttlSecondsAfterEmpty: 10
  requirements:
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - p3
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-west-2a
    - us-west-2b
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

gpu_deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate-gpu
spec:
  replicas: 2
  selector:
    matchLabels:
      app: inflate-gpu
  template:
    metadata:
      labels:
        app: inflate-gpu
    spec:
      terminationGracePeriodSeconds: 0
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
        - name: inflate-gpu
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"

Resource Specs and Logs

$ k describe po inflate-gpu-9c89f599b-bwc7q                                                                     
Name:           inflate-gpu-9c89f599b-bwc7q
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=inflate-gpu
                pod-template-hash=9c89f599b
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/inflate-gpu-9c89f599b
Containers:
  inflate-gpu:
    Image:      public.ecr.aws/eks-distro/kubernetes/pause:3.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9rsjg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-9rsjg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  30m (x2 over 30m)     default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling  28m (x2 over 29m)     default-scheduler  0/2 nodes are available: 1 Too many pods, 2 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling  28m                   default-scheduler  0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
  Warning  FailedScheduling  28m (x3 over 29m)     default-scheduler  0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Warning  FailedScheduling  25m (x3 over 27m)     default-scheduler  0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling  4m16s (x25 over 25m)  default-scheduler  0/3 nodes are available: 1 node(s) had taint {workload: service}, that the pod didn't tolerate, 2 Insufficient nvidia.com/gpu.
  Normal   Nominate          98s (x15 over 29m)    karpenter          Pod should schedule on ip-10-0-155-67.us-west-2.compute.internal
tzneal commented 2 years ago

You'll need to install the NVidia device plugin which is responsible for registering the gpu resources on the node. See the note here for more information.

go4real commented 2 years ago

@tzneal It works fine now. Thank you for the guide!