Open sandeepkp1175 opened 2 months ago
We try to discover the CLUSTER_ENDPOINT if the value is not set by doing the eks:DescribeCluster
call. You can set the value for it such that it points to your API sever endpoint.
Hi @jigisha620 Thanks for figuring out the issue. I added the CLUSTER_ENDPOINT as the environment variable. After adding the value when the Karpenter pods try to come up they throw the below error. It looks like the application is looking for the AWS SQSQueue. Please see the error below.
` panic: fetching queue url, AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist.
goroutine 1 [running]: github.com/samber/lo.must({0x2656020, 0xc0005e90e0}, {0x0, 0x0, 0x0}) github.com/samber/lo@v1.39.0/errors.go:53 +0x1e9 github.com/samber/lo.Must... github.com/samber/lo@v1.39.0/errors.go:65 github.com/aws/karpenter-provider-aws/pkg/controllers.NewControllers({0x317b8f8, 0xc000770e10}, 0x7fa366877228?, {0x317f728, 0x48ef8e0}, {0x318b680?, 0xc000b56750}, {0x314e820?, 0xc0007ca990?}, 0xc0009e2f90, ...) github.com/aws/karpenter-provider-aws/pkg/controllers/controllers.go:60 +0x525 main.main() github.com/aws/karpenter-provider-aws/cmd/controller/main.go:55 +0x63e `
You may want to set the value for INTERRUPTION_QUEUE
. You can find more details here. Important thing to note here - Karpenter watches an SQS queue which receives critical events from AWS services which may affect your nodes. Karpenter requires that an SQS queue be provisioned and EventBridge rules and targets be added that forward interruption events from AWS services to the SQS queue.
If you haven't created one already, you can follow the steps in the getting-started-guide. You may not need everything that is created as part of the cloudformation stack but can follow the steps mentioned there to create the queue.
Thank you @jigisha620 for the response. I'll give a try and let you know.
Hello @jigisha620
We still see the same error:
panic: fetching queue url, AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist. Please see the error below. We added the INTERRUPTION_QUEUE as
goroutine 1 [running]: github.com/samber/lo.must({0x2656020, 0xc0001be9a0}, {0x0, 0x0, 0x0}) github.com/samber/lo@v1.39.0/errors.go:53 +0x1e9 github.com/samber/lo.Must... github.com/samber/lo@v1.39.0/errors.go:65 github.com/aws/karpenter-provider-aws/pkg/controllers.NewControllers({0x317b8f8, 0xc000712000}, 0x7fae3d8b3228?, {0x317f728, 0x48ef8e0}, {0x318b680?, 0xc00081cab0}, {0x314e820?, 0xc000010ff0?}, 0xc0004b6c40, ...) github.com/aws/karpenter-provider-aws/pkg/controllers/controllers.go:60 +0x525 main.main() github.com/aws/karpenter-provider-aws/cmd/controller/main.go:55 +0x63e
Can you validate if the queue exists and it is in the correct region? Can you share the output of the kubectl describe pod <karpenter_pod>
?
Hi @jigisha620
Please see below the describe pod output:
`
Name: karpenter-bd58b9c96-pt5d6
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: karpenter
Node: node-ip.ec2.internal/node-ip
Start Time: Mon, 15 Apr 2024 14:43:18 -0500
Labels: app.kubernetes.io/instance=karpenter
app.kubernetes.io/name=karpenter
pod-template-hash=bd58b9c96
Annotations: cni.projectcalico.org/containerID: e27277da87db35e00056b649ef81c2b615ed18e4382f43b554e4a1b90d526434
cni.projectcalico.org/podIP:
Warning Unhealthy 3h57m kubelet Readiness probe failed: Get "http://pod-ip:8081/readyz": read tcp node-ip:44196->pod-ip:8081: read: connection reset by peer Warning Unhealthy 82m kubelet Readiness probe failed: Get "http://pod-ip:8081/readyz": read tcp node-ip:36590->pod-ip:8081: read: connection reset by peer Normal Pulled 77m (x211 over 19h) kubelet Container image "public.ecr.aws/karpenter/controller:0.35.4@sha256:27a73db80b78e523370bcca77418f6d2136eea10a99fc87d02d2df059fcf5fb7" already present on machine Warning BackOff 2m8s (x5295 over 19h) kubelet Back-off restarting failed container
`
@sandeepkp1175 From this discussion, it's unclear to me whether you want to enable the interruption queue or not. If you do, then you need to create the SQS queue and set the INTERRUPTION_QUEUE
environment variable to be equal to the queue name. If you don't want to enable it, then you need to unset this value when deploying Karpenter.
@jonathan-innis I have already created the queue and passed the name of the queue in the environment variable. Even after creating the queue we are still seeing that the queue is not being detected in the Karpenter pod error logs. @jigisha620 asked me to provide the pod logs which I've provided in the response. Attaching it again.
@jigisha620 / @jonathan-innis please let me know if you got a chance to look at the issue.
From looking at your configuation above, it looks like the interruption queue that you are specifying is a full ARN but Karpenter is just expecting the name of the interruption queue for the setting. I'll admit that this isn't clear in https://karpenter.sh/docs/reference/settings/#:~:text=health%20(default%20%3D%208081)-,INTERRUPTION_QUEUE,-%2D%2Dinterruption%2Dqueue so regardless, we should improve this description so it's more clear in our reference docs.
Hi @jonathan-innis @jigisha620
We are able to bring up the Karpenter pods by providing only the queue name instead of ARN in the INTERRUPTION_QUEUE environment variable. However the Karpenter application tries to call the pricing:GetProducts. Please see the below error.
{"level":"INFO","time":"2024-04-18T20:25:20.422Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"state.nodeclaim","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":10} {"level":"INFO","time":"2024-04-18T20:25:20.423Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":1000} {"level":"INFO","time":"2024-04-18T20:25:20.427Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclaim.tagging","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":1} {"level":"INFO","time":"2024-04-18T20:25:20.433Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclaim.consistency","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":10} {"level":"INFO","time":"2024-04-18T20:25:20.433Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":100} {"level":"INFO","time":"2024-04-18T20:25:20.434Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclaim.disruption","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","worker count":10} {"level":"INFO","time":"2024-04-18T20:25:20.434Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodepool.counter","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10} {"level":"INFO","time":"2024-04-18T20:25:20.445Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","worker count":10} {"level":"INFO","time":"2024-04-18T20:25:20.458Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"nodepool.hash","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10} {"level":"ERROR","time":"2024-04-18T20:25:21.094Z","logger":"controller.pricing","message":"retreiving on-demand pricing data, AccessDeniedException: User: arn:aws:sts::acct-id:assumed-role/worker-iam-role/instance-id is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action; AccessDeniedException: User: arn:aws:sts::acct-id:assumed-role/worker-iam-role/instance-id is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action","commit":"17dd42b"}
Our workers did not had the access to this policy before. After adding this policy, containing the required action we are now seeing the below error in the Karpenter pod.
{"level":"INFO","time":"2024-04-19T18:59:48.901Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"state.nodepool","controllerGroup":"karpenter.sh","controllerKind":"NodePool","worker count":10} {"level":"INFO","time":"2024-04-19T18:59:48.901Z","logger":"controller","message":"Starting workers","commit":"17dd42b","controller":"state.daemonset","controllerGroup":"apps","controllerKind":"DaemonSet","worker count":10} {"level":"ERROR","time":"2024-04-19T18:59:49.466Z","logger":"controller.pricing","message":"retreiving on-demand pricing data, AccessDeniedException: User: arn:aws:sts::aws-account-id:assumed-role/worker-iam-role/instance-id is not authorized to perform: pricing:GetProducts because no service control policy allows the pricing:GetProducts action; AccessDeniedException: User: arn:aws:sts::aws-account-id:assumed-role/worker-iam-role/instance-id is not authorized to perform: pricing:GetProducts because no service control policy allows the pricing:GetProducts action","commit":"17dd42b"}
We are planning to modify the scp. Could you please let us know if there are any other permissions we need to grant to have it running successfully? Thank you for the guidance.
@sandeepkp1175 Hard to know since permissions tend to be a trial-and-error game assuming that you have configured your role consistent with what's in the "Getting Started" guide. Try getting the SCP unblocked and then come back if you are still having issues getting Karpenter up and running with permissions.
Thank you @jonathan-innis . I've requested for the SCP unblock and let you know if I hit another blocker.
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.
Can we please consider reopening this one?
At my current company we'd really appreciate the Openshift support for Karpenter, as we feel this combination of the two great projects might be fruitful and may close some of our crucial infrastructure needs.
I'm also for reopening this issue.
Openshift offers support for kind: "ClusterAutoscaler"
since at least v4.9
.
Would it not be easire for Karpenter to dynamically generate that resource and apply it via the apiserver?
@jigisha620 @jonathan-innis
We are seeing the below error while trying to run the Karpenter in Openshift cluster. Could you please advise? Also please help in re-opening the GitHub issue.
{"level":"ERROR","time":"2024-05-28T20:49:19.963Z","logger":"controller.disruption","message":"listing instance types for default, resolving node class, EC2NodeClass.karpenter.k8s.aws \"default\" not found","commit":"17dd42b"} {"level":"ERROR","time":"2024-05-28T20:49:25.665Z","logger":"controller.provisioner","message":"skipping, unable to resolve instance types, resolving node class, EC2NodeClass.karpenter.k8s.aws \"default\" not found","commit":"17dd42b","nodepool":"default"} {"level":"INFO","time":"2024-05-28T20:49:25.812Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"17dd42b","pods":"podname/model-d4dnm","duration":"162.449739ms"} {"level":"ERROR","time":"2024-05-28T20:49:25.812Z","logger":"controller.provisioner","message":"Could not schedule pod, all available instance types exceed limits for nodepool: \"default\"","commit":"17dd42b","pod":"podname/model-d4dnm"} {"level":"ERROR","time":"2024-05-28T20:49:27.384Z","logger":"controller.provisioner","message":"skipping, unable to resolve instance types, resolving node class, EC2NodeClass.karpenter.k8s.aws \"default\" not found","commit":"17dd42b","nodepool":"default"}
Also we want to know what will happen to the existing workloads running on the nodes provisioned by Openshift Autoscaler?
@sandeepkp1175 Do you have an EC2NodeClasses defined in the cluster?
Hello @engedaam
We have the following EC2NodeClasses defined in the cluster.
` apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: annotations: karpenter.k8s.aws/ec2nodeclass-hash: '12705821474257842030' karpenter.k8s.aws/ec2nodeclass-hash-version: v1 creationTimestamp: '2024-05-29T21:02:26Z' finalizers:
NodePool yaml is below. Also kindly let us know what will happen to the existing nodes once Karpenter starts. Will they continue to run?
`apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: annotations: karpenter.sh/nodepool-hash: '10536667676019604368' karpenter.sh/nodepool-hash-version: v1
@sandeepkp1175 Are you able to see a status section for the EC2NodeClass? It does not seem to be included
Yes we are able to see the status section. Please see below
`status: amis:
Hi @jonathan-innis @jigisha620 @engedaam
Any thoughts on the above issue?
@jonathan-innis @jigisha620 Any findings on this issue?
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.
Can we remove the stale lifecycle tag?
Description
What problem are you trying to solve? We are trying to adopt Karpenter for Redhat Openshift version 4.12.27. Backend master and worker nodes are EC2 instances. But while doing so we are getting we are getting the below error:
{"level":"FATAL","time":"2024-04-11T15:27:17.626Z","logger":"controller","message":"unable to detect the cluster endpoint, failed to resolve cluster endpoint, AccessDeniedException: User: arn:aws:sts:::assumed-role//i- is not authorized to perform: eks:DescribeCluster on resource: arn:aws:eks:us-east-1::cluster/","commit":"17dd42b"}
How important is this feature to you?
We run a variety of workloads in the cluster and for each type of workloads we have to specify the machine type. By adopting Karpenter we will let Karpenter choose the node type during the autoscaling based on the workloads. Also it will reduce the turn around time of the pods awaiting to be scheduled by allocating enough compute.