aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.15k stars 849 forks source link

bug: deprecated AMIs should be able to be listed when specifying ami id on the amiSelectorTerms #6393

Open shabbskagalwala opened 1 week ago

shabbskagalwala commented 1 week ago

Description

Observed Behavior: We pin amis using ami id on our ec2 node class to be able to turn on drift and avoid untested ami's from reaching production. With this we ran into an issue where an deprecated AMI that is not owned by our AWS account was not discovered by Karpeneter when creating a new nodeclaim from an existing nodepool/ec2nodeclass.

An AMI that is deprecated is still discoverable and can be used https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ami-deprecate.html this is evident from using the aws cli (see the example below)

aws cli describe-images ✗ aws ec2 describe-images --region us-west-2 --image-ids ami-002c17c43a2d082b5 { "Images": [ { "Architecture": "x86_64", "CreationDate": "2024-03-17T07:09:41.000Z", "ImageId": "ami-002c17c43a2d082b5", "ImageLocation": "*******794/***-amazon-eks-node-1.27-v20240307-yellow-MTcxMDY1NTcyMC0y", "ImageType": "machine", "Public": false, "OwnerId": "*******794", "PlatformDetails": "Linux/UNIX", "UsageOperation": "RunInstances", "State": "available", "BlockDeviceMappings": [ { "DeviceName": "/dev/xvda", "Ebs": { "DeleteOnTermination": true, "Iops": 3000, "SnapshotId": "snap-0cfc13de6d2d8d426", "VolumeSize": 10, "VolumeType": "gp3", "Throughput": 125, "Encrypted": true } } ], "Description": "*** EKS Kubernetes Worker AMI with AmazonLinux2 image, (k8s: 1.27.9, containerd: 1.7.*)", "EnaSupport": true, "Hypervisor": "xen", "Name": "***-amazon-eks-node-1.27-v20240307-yellow-MTcxMDY1NTcyMC0y", "RootDeviceName": "/dev/xvda", "RootDeviceType": "ebs", "SriovNetSupport": "simple", "Tags": [ .... .... .... ], "VirtualizationType": "hvm", "DeprecationTime": "2024-06-19T06:09:00.000Z" } ] }

The above ami as can be seen on the DeprecationTime field was deprecated and Karpenter failed to find the ami when provisioning a new node for an existing nodepool/ec2nodeclass.

When scale up event happened on the node pool we saw the following message

no ami matching constraint found 

An example our ec2nodeclass

ec2 node class ✗ k get ec2nodeclass sandbox -o yaml apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: annotations: karpenter.k8s.aws/ec2nodeclass-hash: "2915896952981468854" kubernetes.io/description: NodePool provisioned for ASL sandbox in dev-cluster meta.helm.sh/release-name: sandbox-workers meta.helm.sh/release-namespace: sandbox creationTimestamp: "2024-04-04T16:56:25Z" finalizers: - karpenter.k8s.aws/termination generation: 2 labels: app.kubernetes.io/managed-by: Helm name: sandbox resourceVersion: "981834955" uid: e17e714f-9b90-4119-aee8-225b66f6574f spec: amiFamily: AL2 amiSelectorTerms: - id: ami-002c17c43a2d082b5

Digging further into the code we saw that there is a field on the sdk https://github.com/aws/aws-sdk-go-v2/blob/main/service/ec2/api_op_DescribeImages.go#L183 which will enable listing deprecated amis for when an ami is not owned by the current aws account. This is by default set to false but it will list deprecated ami which are owned by the current aws account itself but not when the owner is different.

The amiFamily controller might need to add this flag when the owner of the ami is different than the current aws account id https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/amifamily/ami.go

NOTE: These are private AMIs not owned by AWS

Expected Behavior:

Karpenter should be able to detect deprecated amis and not fail provisioning new node claims and instances

Reproduction Steps (Please include YAML):

  1. Use an AMI that is deprecated and not owned by the AWS account where Karpenter is running.
  2. Reference that AMI in the ec2nodeclass using amiSelectorTerms and id
  3. Wait for Karpenter to provision the nodeclaim and fail with an message on the status no ami matching constraint found

Versions:

✗ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:39:03Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.13-eks-3af4770", GitCommit:"4873544ec1ec7d3713084677caa6cf51f3b1ca6f", GitTreeState:"clean", BuildDate:"2024-04-30T03:31:44Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}
jigisha620 commented 1 week ago

Wondering if there is a reason for using a deprecated AMI?

shabbskagalwala commented 1 week ago

Wondering if there is a reason for using a deprecated AMI?

When an AMI is deprecated, it affects the entire node pool. So any scale up/scheduling events for pods are stuck when Karpenter can't find the AMI.

We have to pin AMI's using ids due to issues for example and also to be able to use drift so Karpenter doesn't always get unvetted amis to production workloads. By default existing ASGs and LaunchTemplates are able to detect deprecated AMIs and continue functioning without affecting workloads but Karpenter fails in this scenario.

Edit:

To provide more context, let's assume there are 5 different node pools per EKS cluster across 10 clusters, each in multiple AWS accounts. Keeping track of AMI deprecations and updating every AMI for each node pool and architecture to ensure workloads are uninterrupted can become a significant operational burden. Existing node pools should continue to function and avoid causing pod scheduling failures, even if a node pool declares an AMIs that has been deprecated.

jonathan-innis commented 1 week ago

By default existing ASGs and LaunchTemplates are able to detect deprecated AMIs and continue functioning without affecting workloads but Karpenter fails in this scenario

What about prioritizing the non-deprecated AMIs in front of the deprecated ones? By nature of what the word "deprecated" means, this seems like the right behavior to me -- discover them but just don't take them if you non-deprecated ones available

jonathan-innis commented 1 week ago

We have to pin AMI's using ids due to issues for example and also to be able to use drift so Karpenter doesn't always get unvetted amis to production workloads

I assume that a deprecation for your organization happens without a lot of coordination if the rug can get pulled out from under you like it did here. How do you expect to orchestrate the upgrade after the AMI has been deprecated? e.g. Something like continuing to launch instances on the deprecated AMI until you change the ID term to get to a non-deprecated one.

shabbskagalwala commented 1 week ago

I assume that a deprecation for your organization happens without a lot of coordination if the rug can get pulled out from under you like it did here

Not really there are set time periods for this to happen for different AMIs across different architectures. For karpenter to be able to detect deprecated AMIs means that the workloads are unaffected and continue functioning for whenever there is a burst in traffic and the HPA kicks in for example.

How do you expect to orchestrate the upgrade after the AMI has been deprecated?

Ideally, yes like you suggested. Vet the AMIs in non production environment and promote them to production by updating the ec2nodeclasses. With ASGs and cluster auto scaler this is how it would work, where the ASG would scale up the nodes and not affect the existing workloads and I understand that Karpenter != ASG but having somewhat of the similar functionality would help where it doesn't become a race against time to upkeep AMIs along with the EKS cluster upgrades :)

I am willing to work on this, if you and the community think this is something that can be included in Karpenter.

shabbskagalwala commented 1 week ago

What about prioritizing the non-deprecated AMIs in front of the deprecated ones? By nature of what the word "deprecated" means, this seems like the right behavior to me -- discover them but just don't take them if you non-deprecated ones available

Thats one option but Karpenter currently cannot even discover AMIs that have been deprecated. Another probable solution would be to have a field includeDeprecated on the amiSelectorTerms when an id is specified - which by default can be set to false.

Edit: somewhat similar to how the terraform ami datasource does it https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ami#include_deprecated