Closed bribroder closed 7 months ago
Are any NodeClaims
created? If so, could you check the providerID in the NodeClaim
status, check that a matching instance exists in the EC2 console, and share the UserData for the EC2 instance? Could you also share Karpenter logs and your pod / nodepool specs?
I do get a nodeclaim and the instance starts up with this userdata:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: application/node.eks.aws
# Karpenter Generated NodeConfig
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
metadata:
creationTimestamp: null
spec:
cluster:
apiServerEndpoint: https://asdf.gr1.us-west-1.eks.amazonaws.com
certificateAuthority: a1s2d3f4g5h6j7k8l9...
cidr: 10.100.0.0/16
name: demo-1
containerd: {}
instance:
localStorage: {}
kubelet:
config:
clusterDNS:
- 10.100.0.10
maxPods: 58
flags:
- --node-labels="karpenter.sh/capacity-type=on-demand,karpenter.sh/nodepool=demo-1-default,project=test"
--//--
nodepool spec:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: demo-1-default
spec:
labels:
project: demo
spec:
nodeClassRef:
name: default
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values: ["4", "8", "16", "32", "48", "64"]
- key: "karpenter.k8s.aws/instance-memory"
operator: Gt
values: ["4096"]
- key: "karpenter.k8s.aws/instance-hypervisor"
operator: In
values: ["nitro"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["4"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
logs:
{"level":"DEBUG","time":"2024-03-06T18:51:18.110Z","logger":"controller.provisioner","message":"discovered instance types","commit":"2c8f2a5","count":761}
{"level":"DEBUG","time":"2024-03-06T18:51:18.436Z","logger":"controller.provisioner","message":"adding requirements derived from pod volumes, [{topology.ebs.csi.aws.com/zone In [us-west-1f]}]","commit":"2c8f2a5","pod":"prometheus/prometheus-server-5f49566bd8-hppgh"}
{"level":"DEBUG","time":"2024-03-06T18:51:18.436Z","logger":"controller.provisioner","message":"adding requirements derived from pod volumes, [{topology.ebs.csi.aws.com/zone In [us-west-1f]}]","commit":"2c8f2a5","pod":"prometheus/prometheus-alertmanager-0"}
{"level":"DEBUG","time":"2024-03-06T18:51:18.438Z","logger":"controller.provisioner","message":"27 out of 761 instance types were excluded because they would breach limits","commit":"2c8f2a5","nodepool":"demo-default"}
{"level":"INFO","time":"2024-03-06T18:51:18.508Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"kube-system/metrics-server-5dc9dbbd5b-vd9h5, opencost/opencost-784779fc7b-hrwnp, prometheus/prometheus-server-5f49566bd8-hppgh, prometheus/prometheus-prometheus-pushgateway-8647d94cf6-tthtm, prometheus/prometheus-kube-state-metrics-7687f6ddbd-2kpdh and 3 other(s)","duration":"1.566892916s"}
{"level":"INFO","time":"2024-03-06T18:51:18.508Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"2c8f2a5","nodeclaims":1,"pods":8}
{"level":"INFO","time":"2024-03-06T18:51:18.522Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"2c8f2a5","nodepool":"demo-default","nodeclaim":"demo-default-hjxsb","requests":{"cpu":"330m","memory":"490Mi","pods":"15"},"instance-types":"c5.2xlarge, c5.4xlarge, c5.xlarge, c5a.2xlarge, c5a.4xlarge and 95 other(s)"}
{"level":"DEBUG","time":"2024-03-06T18:51:18.723Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"2c8f2a5","nodeclaim":"demo-default-hjxsb","launch-template-name":"karpenter.k8s.aws/18177513142468916090","id":"lt-053c20e545891b82f"}
{"level":"DEBUG","time":"2024-03-06T18:51:18.884Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"2c8f2a5","nodeclaim":"demo-default-hjxsb","launch-template-name":"karpenter.k8s.aws/5054466405883801158","id":"lt-0f9430618b98eaa61"}
{"level":"INFO","time":"2024-03-06T18:51:21.497Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"2c8f2a5","nodeclaim":"demo-default-hjxsb","provider-id":"aws:///us-west-1f/i-057c9a55ef","instance-type":"c5a.xlarge","zone":"us-west-1f","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"493706490675", "memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"DEBUG","time":"2024-03-06T18:51:26.957Z","logger":"controller.provisioner","message":"adding requirements derived from pod volumes, [{topology.ebs.csi.aws.com/zone In [us-west-1f]}]","commit":"2c8f2a5","pod":"prometheus/prometheus-server-5f49566bd8-hppgh"}
{"level":"INFO","time":"2024-03-06T18:51:26.958Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"kube-system/metrics-server-5dc9dbbd5b-vd9h5, opencost/opencost-784779fc7b-hrwnp, prometheus/prometheus-server-5f49566bd8-hppgh, prometheus/prometheus-prometheus-pushgateway-8647d94cf6-tthtm, prometheus/prometheus-kube-state-metrics-7687f6ddbd-2kpdh and 3 other(s)","duration":"15.687819ms"}
My EC2 Node Class shows status:
β amis:
β - id: ami-0552b3e5085247f36
β name: amazon-eks-node-al2023-x86_64-standard-1.29-v20240227
β requirements:
β - key: kubernetes.io/arch
β operator: In
β values:
β - amd64
β - id: ami-0e3a77bda9dcc7add
β name: amazon-eks-node-al2023-arm64-standard-1.29-v20240227
β requirements:
β - key: kubernetes.io/arch
β operator: In
β values:
β - arm64
(the instance in the nodeclaim uses the first one, ami-0552b3e5085247f36)
Everything here looks correct to me. Are you able to ssh / ssm into the EC2 instance? Would you be able to check the output from the nodeadm and kubelet logs?
sudo journalctl -u nodeadm-run
and
sudo journalctl -u kubelet
Some interesting stuff in the kubelet logs!
"Attempting to register node" node="ip-172-16-0-100.domain.com"
"Unable to register node with API server" err="nodes \"ip-172-16-0-100.domain.com\" is forbidden: node \"ip-172-16-0-100.ec2.internal\" is not allowed to modify node \"ip-172-16-0-100.domain.com\"" node="ip-172-16-0-100.domain.com"
"Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-172-16-0-100.domain.com\" not found"
Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-172-16-0-100.domain.com" is forbidden: User "system:node:ip-172-16-0-100.ec2.internal" cannot getresource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
We have a DHCP options set which specifies a custom domain name, which is where domain.com comes from
Similar reports: https://github.com/awslabs/amazon-eks-ami/issues/1263 https://github.com/awslabs/amazon-eks-ami/issues/1376 https://github.com/awslabs/amazon-eks-ami/issues/1457
Not sure why it only just came up, we have been on 1.28 and 1.29 for a while with this DHCP config... For AL2, the userdata looks like:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash -xe
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
/etc/eks/bootstrap.sh 'demo' --apiserver-endpoint 'https://asdf.gr1.us-west-1.eks.amazonaws.com' --b64-cluster-ca 'a1s2d3f4g5h6j7k8l9' \
--container-runtime containerd \
--dns-cluster-ip '10.100.0.10' \
--use-max-pods false \
--kubelet-extra-args '--node-labels="karpenter.sh/capacity-type=on-demand,karpenter.sh/nodepool=demo-default,project=demo" --max-pods=58'
--//--
Interesting, the same domain name is used with AL2 and Ubuntu without problems? This seems like a better issue for the folks over at the EKS AMI repo if you wanted to open something there. Seems like there might be a regression between the AL2 and AL2023 AMIs.
Will close in favor of the other issue in amazon-eks-ami. Feel free to re-open if I've misunderstood.
sorry guys, i know this issue is closed.. but a quick one.. do we really need to specify a NodeConfig to use AL2023 with karpenter.. the docs says a cidr must be spcified
An EC2NodeClass that uses AL2023 requires the cluster CIDR for launching nodes. Cluster CIDR will not be resolved for EC2NodeClass that doesnβt use AL2023.
Cluster CIDR is resolved automatically for AL2023 NodeClasses. You don't need to supply this via UserData. That note just indicates cluster CIDR is only resolved as part of NodeClass readiness for AL2023 NodeClasses.
Description
Observed Behavior:
When my node class specifies
amiFamily: AL2023
, nodes never launch. I see demand on the provisioner and schedulable pods waiting, but I never get any new nodes.If I change the amiFamily to
Ubuntu
or remove it so it defaults toAL2
, nodes start up as usualChanging it to a nonsense value gets the edit rejected by the cluster and I see AL2023 in the error message's list of valid options
Reproduction Steps (Please include YAML):
Versions:
Chart Version: karpenter-0.35.0
Kubernetes Version (
kubectl version
): 1.29Please vote on this issue by adding a π reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment