Closed VladFCarsDevops closed 7 months ago
Can you share the status from your EC2NodeClass? Typically, you will see this error when Karpenter isn't able to discover your subnets and you don't have any zones that Karpenter can leverage for scheduling pods to instance types.
Hi @jonathan-innis Thanks for responding! I figured out the issue. The core issue was that the initial error message:
Could not schedule pod, incompatible with nodepool "np-244-nodepool", daemonset overhead={"cpu":"780m","memory":"1120Mi","pods":"6"}, no instance type satisfied resources
was misleading! The problem was that the EC2NodeClass manifest was referring to a role:
role: "${var.workspace}-karpenter-controller"
rather than the InstanceProfile
. In the official documentation: https://karpenter.sh/v0.33/concepts/nodeclasses/
It is mentioned that the InstanceProfile
is optional and also mentions that you can specify either role
or InstanceProfile
. In my case, it did not work with role
, but worked with InstanceProfile
, thus the error message did not provide me any useful information for debugging and instead led me in the other direction... It was purely resolved by experimenting.. I guess in this case the expected log output should mention something about the necessity of having InstanceProfile
and maybe documentation should be updated?
@jonathan-innis Another issue that I noticed, when I create NodePool
and EC2NodeClass
from the plain K8s yml manifests it works as expected. However, if I create NodePool
and EC2NodeClass
from the kubernetes_resource
terraform resource, it will result in the same error even though the configuration is identical. Works with kubectl_manifest
though.
In my case, it did not work with role, but worked with InstanceProfile
@VladFCarsDevops are you using a private cluster?
However, if I create NodePool and EC2NodeClass from the kubernetes_resource terraform resource, it will result in the same error even though the configuration is identical
Is this when you are using role
for the EC2NodeClass
?
@engedaam My EKS cluster endpoints are both Public and Private.
Yes! When I switched to using InstanceProfile
instead of role
in EC2NodeClass
it fixed that misleading error. The official documentation gives you the option to set one of them, but not both as it results in error when applying. My role had almost full access permissions and was correctly attached, but this resulted in the errors I posted above until I switched to InstanceProfile
@VladFCarsDevops would you be will make to a PR for the documentation update?
@engedaam Sure, can you point me to the right location?
It would be here https://github.com/aws/karpenter-provider-aws/tree/main/website/content/en. You will need to make the same changes to v0.32, v0.33, and v0.34
was misleading! The problem was that the EC2NodeClass manifest was referring to a role
@VladFCarsDevops This is surprising to me. From what I know about the current state of the code, we shouldn't return back a different response during scheduling when using an instance profile vs. using a role. It's a bit hard to parse the terraform manifests that you pasted above (also, unfortunately none of the maintainers on the karpenter team are TF experts). Do you have direct access to the cluster and if you do, could you post the YAML version of the EC2NodeClass and NodePool when you have the instance profile vs. when you have the role?
Also, as for surfacing this information better. We're currently talking about how we can improve observability for Karpenter using status conditions across all of our resources. This is talked about here: https://github.com/kubernetes-sigs/karpenter/issues/493. I'd imagine that surfacing a condition directly like InstanceProfileReady
would have helped debug here.
was misleading! The problem was that the EC2NodeClass manifest was referring to a role
@VladFCarsDevops This is surprising to me. From what I know about the current state of the code, we shouldn't return back a different response during scheduling when using an instance profile vs. using a role. It's a bit hard to parse the terraform manifests that you pasted above (also, unfortunately, none of the maintainers on the karpenter team are TF experts). Do you have direct access to the cluster and if you do, could you post the YAML version of the EC2NodeClass and NodePool when you have the instance profile vs. when you have the role?
Also, as for surfacing this information better. We're currently talking about how we can improve observability for Karpenter using status conditions across all of our resources. This is talked about here: kubernetes-sigs/karpenter#493. I'd imagine that surfacing a condition directly like
InstanceProfileReady
would have helped debug here.
@jonathan-innis Oh I tried creating EC2NodeClass and NodePool with plain yamls and had the same errors up until I changed from role to an InstanceProfile, a friend of mine working at another company faced the same issue. I think updating the instructions in the docs, will save a ton of time for people debugging a wrong log output when it has nothing to do with resources.
Oh I tried creating EC2NodeClass and NodePool with plain yamls and had the same errors up until I changed from role to an InstanceProfile
I don't disagree if this is really what is happening, but what I am trying to say is that these issues seem potentially unrelated to me. From looking over the code and reasoning about where we evaluate the instance profile and the role when it comes to making scheduling decisions, we don't have them affect scheduling decisions at all, which is why I'm thinking that it's odd that you are seeing "Could not schedule pod, incompatible with nodepool" and pointing back to the fact that you were using a role vs. an instance profile as the reason. Do you know if the EC2NodeClass that you were referencing was properly resolving the subnets or security groups that you were specifying by checking the status?
One common problem that we see is that subnets don't get resolved; therefore, the instance types aren't able to produce zones and so you will see the error that you pasted over here when you are scheduling pods
@VladFCarsDevops can you please share your final ec2nc and where do you get the right instance-profile from?
@ahoehma You have to create Instance profile separately , give permissions and reference it in you ec2nc
resource "kubectl_manifest" "nodepool" {
yaml_body = <<-YAML
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
nodeClassRef:
name: default
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 24h
YAML
depends_on = [
helm_release.karpenter
]
}
resource "kubectl_manifest" "ec2nodeclass" {
yaml_body = <<-YAML
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
instanceProfile: ${var.workspace}-KarpenterNodeInstanceProfile
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${var.workspace}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${var.workspace}"
amiSelectorTerms:
- id: "${var.default_ami_id}"
blockDeviceMappings:
- deviceName: ${var.ebs_device_name}
ebs:
volumeSize: ${var.ebs_volume_size}
volumeType: ${var.ebs_volume_type}
encrypted: true
deleteOnTermination: true
tags: ${jsonencode(merge(data.aws_default_tags.current.tags, {"Name" = "${var.workspace}-Karpenter-autoscaled-node"}))}
YAML
depends_on = [
helm_release.karpenter
]
}
Do you know if the EC2NodeClass that you were referencing was properly resolving the subnets or security groups that you were specifying by checking the status?
Yes, SG and Subnets were properly setup
Description
Observed Behavior: Karpenter pod logs:
Could not schedule pod, incompatible with nodepool "np-244-nodepool", daemonset overhead={"cpu":"780m","memory":"1120Mi","pods":"6"}, no instance type satisfied resources {"cpu":"6780m","memory":"11360Mi","pods":"7"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-cpu In [16 32 36 4 48 and 1 others], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [np-244-nodepool], kubernetes.io/arch In [amd64], topology.kubernetes.io/zone In [us-east-1a us-east-1b us-east-1c] (no instance type has enough resources)
I tried to remove everything from the
requirements
to make NodePool flexible as much as possible, but I got the same error:Could not schedule pod, incompatible with nodepool "np-244-nodepool", daemonset overhead={"cpu":"780m","memory":"1120Mi","pods":"6"}, no instance type satisfied resources (no instance type has enough resources)
Expected Behavior: Karpenter scales nodes dynamically regardless of the workload.
Reproduction Steps (Please include YAML):
`resource "helm_release" "karpenter" { namespace = "karpenter" create_namespace = true
name = "karpenter" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter" version = "v0.33.2"
wait = true
set { name = "serviceAccount.annotations.eks\.amazonaws\.com/role-arn" value = var.karpenter_controller_arn }
set { name = "settings.clusterName" value = var.eks_name }
set { name = "settings.clusterEndpoint" value = var.cluster_endpoint }
set { name = "settings.defaultInstanceProfile" value = "np-244-KarpenterNodeInstanceProfile" }
set { name = "logLevel" value = "debug" } }
resource "kubectl_manifest" "nodepool" { yaml_body = <<-YAML apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: ${var.workspace}-nodepool spec: template: spec: requirements:
resource "kubectl_manifest" "ec2nodeclass" { yaml_body = <<-YAML apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: ${var.workspace}-node-class spec: amiFamily: "AL2" role: "${var.workspace}-karpenter-controller" subnetSelectorTerms:
IAM configurations:
data "aws_iam_policy_document" "karpenter_controller_assume_role_policy" { statement { actions = ["sts:AssumeRoleWithWebIdentity"] effect = "Allow"
} }
resource "aws_iam_policy" "karpenter_policy" { name = "${var.workspace}-KarpenterPolicy" path = "/" description = "Policy for Karpenter"
policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "KarpenterInstanceProfileManagement", "Effect": "Allow", "Action": [ "iam:CreateInstanceProfile", "iam:AddRoleToInstanceProfile", "iam:RemoveRoleFromInstanceProfile", "iam:PassRole", "iam:GetInstanceProfile", "iam:TagInstanceProfile" ], "Resource": "" }, { "Sid": "KarpenterEC2Actions", "Effect": "Allow", "Action": [ "ec2:RunInstances", "ec2:DescribeSubnets", "ec2:DescribeSpotPriceHistory", "ec2:DescribeSecurityGroups", "ec2:DescribeLaunchTemplates", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeInstanceTypeOfferings", "ec2:DescribeAvailabilityZones", "ec2:DescribeImages", "ec2:DeleteLaunchTemplate", "ec2:CreateTags", "ec2:CreateLaunchTemplate", "ec2:CreateFleet", "ssm:GetParameter", "pricing:GetProducts" ], "Resource": "" }, { "Sid": "ConditionalEC2Termination", "Effect": "Allow", "Action": "ec2:TerminateInstances", "Resource": "", "Condition": { "StringLike": { "ec2:ResourceTag/Name": "karpenter*" } } } ] } EOF }
resource "aws_iam_role" "karpenter_controller" { assume_role_policy = data.aws_iam_policy_document.karpenter_controller_assume_role_policy.json name = "${var.workspace}-karpenter-controller" }
resource "aws_iam_policy" "karpenter_controller" { policy = aws_iam_policy.karpenter_policy.policy name = "${var.workspace}-karpenter-controller" }
resource "aws_iam_role_policy_attachment" "karpenter_controller_attach" { role = aws_iam_role.karpenter_controller.name policy_arn = aws_iam_policy.karpenter_controller.arn }
resource "aws_iam_instance_profile" "karpenter" { name = "${var.workspace}-KarpenterNodeInstanceProfile" role = aws_iam_role.kubernetes-worker-role.name }`
Versions:
Karpenter: v0.33.2:
Kubernetes Version (
1.28
):Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment