aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.75k stars 951 forks source link

karpenter cannot start g5 type instance in cn-north-1 #7272

Open Bill-Tao-Yang opened 2 days ago

Bill-Tao-Yang commented 2 days ago

Description

Observed Behavior: In the cn-north-1 region, karpenter cannot apply for a g5 instance. Even though all configurations are correct, an error message is displayed:

incompatible with provisioner \"coder-china-dev1-cn-coder-mepy-dev\", daemonset overhead={\"cpu\":\"331m\",\"memory\":\"275Mi\",\"pods\":\"6\"}, did not tolerate coder-mepy-dev=true:NoSchedule; incompatible with provisioner \"coder-algo-train\", daemonset overhead={\"cpu\":\"331m\",\"memory\":\"275Mi\",\"pods\":\"6\"}, no instance type satisfied resources {\"cpu\":\"3331m\",\"memory\":\"11539Mi\",\"nvidia.com/gpu\":\"1\",\"pods\":\"7\"} and requirements karpenter.k8s.aws/instance-family In [g5], karpenter.k8s.aws/instance-gpu-count In [1], karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [coder-algo-train], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], noderole In [coder-algo-train], topology.kubernetes.io/zone In [cn-north-1a] (no instance type met all requirements)

Expected Behavior: In the cn-north-1 region, karpenter can apply for a g5 instance

Reproduction Steps (Please include YAML):

This is my karpenter configuration:

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: coder-china-dev1-cn-coder-algo-train
spec:
  amiFamily: Bottlerocket
  blockDeviceMappings:
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 50Gi
      volumeType: gp3
  securityGroupSelector:
    aws-ids: sg-xxxxxx
  subnetSelector:
    Name: dev1-xxxxx

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: coder-algo-train
spec:
  kubeletConfiguration:
    maxPods: 30
  labels:
    noderole: coder-algo-train
  limits:
    resources:
      cpu: "150"
  providerRef:
    name: coder-china-dev1-cn-coder-algo-train
  requirements:
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - g5
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - cn-north-1a
  - key: karpenter.k8s.aws/instance-gpu-count
    operator: In
    values:
    - "1"
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
    - spot
  taints:
  - effect: NoSchedule
    key: coder-algo-train
    value: "true"
  ttlSecondsAfterEmpty: 30
  ttlSecondsUntilExpired: 315360000

this is my pod config:

apiVersion: v1
kind: Pod
metadata:
  name: algo-test-7fdbf696f4-rvv5d
  namespace: china-dev1-cn
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: coder-admin-algo-test-7fdbf696f4
    uid: 1f09e543-c694-461e-a5ea-eb178f4071e9
  resourceVersion: "53629327"
  uid: 652f06e5-1607-41cb-bda2-e3adf3685dd2
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - coder-workspace
        topologyKey: kubernetes.io/hostname
  automountServiceAccountToken: true
  containers:
  - command:
    - sh
    - -c
    image: tf2.7_general2:v24.42.0.rc.3
    imagePullPolicy: Always
    name: dev
    resources:
      limits:
        cpu: "3"
        memory: 11Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "3"
        memory: 11Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeSelector:
    noderole: coder-algo-train
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: sa-coder-algo-train
  serviceAccountName: sa-coder-algo-train
  shareProcessNamespace: false
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: coder-algo-train
    operator: Equal
    value: "true"

Versions: v0.32.10

Bill-Tao-Yang commented 1 day ago

I think it may be related to the cache of instance types. Since our AWS account does not allow access to the public network, Karpenter cannot query the latest type of ec2 instance through ec2api and can only use the initialized instance type.

I checked the pkg/providers/pricing/zz_generated.pricing_aws_cn.go code and found that there is no g5 type instance. The code was updated on September 18, 2023, but g5 type instances can be used in China in April 2024.