aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.84k stars 963 forks source link

EC2 Spot instance using generation 1-3 only #2568

Closed yuriydzobak closed 2 years ago

yuriydzobak commented 2 years ago

Version

Karpenter Version: v0.16.1

Kubernetes Version: v1.23.9 region: us-east-1

Expected Behavior

Create spot instances generation 4-6

Actual Behavior

When I’m using spot instance karpenter found only gen 3, when I hardcoded to use another gen, it shows an error like

2022-09-28T13:31:17.842Z    ERROR   controller.provisioning Could not schedule pod, incompatible with provisioner "spot", no instance type satisfied resources {"cpu":"1","pods":"1"} and requirements karpenter.k8s.aws/instance-family In [c4 m4 m5 m5a m6a and 3 others], karpenter.sh/capacity-type In [spot], kubernetes.io/arch In [amd64], karpenter.sh/provisioner-name In [spot]   {"commit": "5d4ae35-dirty", "pod": "kube-system/inflate-69f485d6c8-6nw8q"

Steps to Reproduce the Problem

---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: spot
spec:
  requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
      - m5.large
      - m5.xlarge
      - m5.2xlarge
      - c5.large
      - c5.xlarge
      - c5a.large
      - c5a.xlarge
      - r5.large
      - r5.xlarge    
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: spot
  ttlSecondsAfterEmpty: 120
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: spot
spec:
  amiFamily: Custom
  amiSelector:
    aws-ids: my-custom
  subnetSelector:                          
    karpenter.sh/discovery: e2e
  securityGroupSelector:                      
    karpenter.sh/discovery: e2e
  instanceProfile: ""

deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
      nodeSelector:              
        karpenter.sh/capacity-type: spot
k scale deployment --replicas 2 inflate

Resource Specs and Logs

when I'm using version 0.16.0 all is ok

2022-09-29T11:20:48.441Z    INFO    controller.provisioning Found 1 provisionable pod(s)    {"commit": "639756a"}
2022-09-29T11:20:48.441Z    INFO    controller.provisioning Computed 1 new node(s) will fit 1 pod(s)    {"commit": "639756a"}
2022-09-29T11:20:48.441Z    INFO    controller.provisioning Launching node with 1 pods requesting {"cpu":"1445m","memory":"651Mi","pods":"7"} from types c5a.large, c5.large, m5.large, r5.large, c5a.xlarge and 4 other(s) {"commit": "639756a", "provisioner": "spot"}
2022-09-29T11:20:48.660Z    DEBUG   controller.provisioning.cloudprovider   Discovered security groups: [sg-00dbdc80a8747f4d0]  {"commit": "639756a", "provisioner": "spot"}
2022-09-29T11:20:48.670Z    DEBUG   controller.provisioning.cloudprovider   Discovered kubernetes version 1.23  {"commit": "639756a", "provisioner": "spot"}
2022-09-29T11:20:48.715Z    DEBUG   controller.provisioning.cloudprovider   Discovered images: [ami-03722da6062481565]  {"commit": "639756a", "provisioner": "spot"}
2022-09-29T11:20:51.868Z    INFO    controller.provisioning.cloudprovider   Launched instance: i-0d6205de49a9a93da, hostname: ip-10-255-255-38.ec2.internal, type: r5.large, zone: us-east-1b, capacityType: spot   {"commit": "639756a", "provisioner": "spot"}

when version 0.16.1 - 0.16.3

2022-09-29T11:25:01.349Z    ERROR   controller.provisioning Could not schedule pod, incompatible with provisioner "spot", no instance type satisfied resources {"cpu":"1","pods":"1"} and requirements karpenter.sh/capacity-type In [spot], kubernetes.io/arch In [amd64], karpenter.sh/provisioner-name In [spot], node.kubernetes.io/instance-type In [c5.large c5.xlarge c5a.large c5a.xlarge m5.2xlarge and 4 others]  {"commit": "b157d45", "pod": "kube-system/inflate-69f485d6c8-9pqlh"}
2022-09-29T11:25:06.311Z    ERROR   controller.provisioning Could not schedule pod, incompatible with provisioner "spot", no instance type satisfied resources {"cpu":"1","pods":"1"} and requirements karpenter.sh/provisioner-name In [spot], node.kubernetes.io/instance-type In [c5.large c5.xlarge c5a.large c5a.xlarge m5.2xlarge and 4 others], karpenter.sh/capacity-type In [spot], kubernetes.io/arch In [amd64]  {"commit": "b157d45", "pod": "kube-system/inflate-69f485d6c8-9pqlh"}

any other issues aren't shown up

Community Note

njtran commented 2 years ago

Did you change your provisioner at all? It looks like your log lines show that you have karpenter.k8s.aws/instance-family In [c4 m4 m5 m5a m6a and 3 others] for your instance-family requirements, but I don't see that in your provisioner.

Additionally in your AWSNodeTemplate, it looks like you're using a custom AMI but not also specifying your UserData. Without this, Karpenter does not know how to bootstrap your node. More info here

yuriydzobak commented 2 years ago

Did you change your provisioner at all? It looks like your log lines show that you have karpenter.k8s.aws/instance-family In [c4 m4 m5 m5a m6a and 3 others] for your instance-family requirements, but I don't see that in your provisioner.

Hi, oh, i confused you because ii tried, to use family, genaration and issue is the same the first logs from different provision configuration

Additionally in your AWSNodeTemplate, it looks like you're using a custom AMI but not also specifying your UserData. Without this, Karpenter does not know how to bootstrap your node. More info here

I know, thanks

njtran commented 2 years ago

I was able to provision a node on v0.16.3 with both PodSpec with NodeSelector:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
      nodeSelector:
        karpenter.k8s.aws/instance-generation: "5"
        karpenter.sh/capacity-type: spot

and ProvisionerRequirements:

    requirements:
    - key: karpenter.k8s.aws/instance-generation
      operator: In
      values:
      - "4"
      - "5"
      - "6"
    - key: karpenter.sh/capacity-type
      operator: In
      values:
      - spot
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
yuriydzobak commented 2 years ago

I reproduce the issue only for KIAM(kube2iam) When I'm using host network with IAM role on the instance, then karpenter can find spot gen 4-6

njtran commented 2 years ago

Do you have more steps on how you ran into this? Are you using v0.16.3? Are your provisioning requirements the same as the ones I ran above? If so, I'm interested how Karpenter's provisioning logic might have changed for you.

yuriydzobak commented 2 years ago

Do you have more steps on how you ran into this? Are you using v0.16.3? Are your provisioning requirements the same as the ones I ran above? If so, I'm interested how Karpenter's provisioning logic might have changed for you.

The steps are the same like in https://karpenter.sh/v0.16.3/getting-started/migrating-from-cas/, only different that I don't use IRSA, I'm using KIAM Yes, I'm using 0.16.3, I saw a new version was published but didn't check yet

yuriydzobak commented 2 years ago

I've checked on another AWS account and i can't reproduce the issue. I'm still finding out what's wrong with my AWS account

yuriydzobak commented 2 years ago

Hi, I tested version 0.17.0 and 0.18.0 but the issue is still existing I think this one broke karpenter #2283

if I run command in AWS account where karpenter can't find gen 4-6. NOTE: It works if product-description is Linux/UNIX (Amazon VPC)

aws ec2 describe-spot-price-history --availability-zone us-east-1b --filters Name=instance-type,Values=m5a.large Name=product-description,Values=Linux/UNIX --region us-east-1
{
    "SpotPriceHistory": []
}

for example, if i run in account where karpenter is working right

ws ec2 describe-spot-price-history --availability-zone us-east-1b --filters Name=instance-type,Values=m5a.large Name=product-description,Values=Linux/UNIX --region us-east-1
{
    "SpotPriceHistory": [
        {
            "AvailabilityZone": "us-east-1b",
            "InstanceType": "m5a.large",
            "ProductDescription": "Linux/UNIX",
            "SpotPrice": "0.046600",
            "Timestamp": "2022-10-15T23:59:06.000Z"
        },
        {
            "AvailabilityZone": "us-east-1b",
            "InstanceType": "m5a.large",
            "ProductDescription": "Linux/UNIX",
            "SpotPrice": "0.046500",
            "Timestamp": "2022-10-15T18:45:02.000Z"
        }, 
        ..........

When I’ve enabled

controller:
  env:
  - name: AWS_ISOLATED_VPC
    value: "true"

it's working now 😎, but why karpenter doesn’t show any error? and why it worked 0.16.0 without it The VPC has NAT. https://karpenter.sh/v0.18.0/troubleshooting/#stale-pricing-data-on-isolated-subnet

tzneal commented 2 years ago

Using AWS_ISOLATED_VPC causes Karpenter to use it's static fixed price list, it will never try to pull spot or on-demand prices so that's why it's working for you.

From what I can tell, if your account still supports EC2 Classic, then the instance type descriptions are Linux/UNIX (Amazon VPC) for the non-EC2 classic types to differentiate them from the EC2 classic ones.

If your account does not support EC2 classic, then the instance type descriptions are Linux/UNIX always.

The code currently only filters for spot prices with a description of Linux/UNIX, which for accounts that support EC2 classic are the classic instance types only. There may not be any of those types available via spot as it's being retired.

We'll need to make a change to identify the non-classic types in classic supporting accounts as the query is different.