Open runningman84 opened 7 months ago
@runningman84 the node should not be marked as "Ready" until the aws-node
pod copies the CNI to /etc/cni/net.d/
, which it does after it finishes initialization. So the scheduler should not schedule application pods on a "Not Ready" node (unless those pods tolerate the node being not ready for some reason)
The VPC CNI plugin is not able to modify node taints, as that would be a sizable security risk
okay what could be the reason for us seeing this behaviour?
okay what could be the reason for us seeing this behaviour?
Do you see the nodes as "Not Ready" during this time window? Do these application pods have their own tolerations?
I have double check that ...
t = point in time... can be seconds or even minute between two numbers
t0 node appears as not ready t1 daemon set pods are scheduled to it t2 aws-node pod is in initializing t2 kube-proxy pod is in initializing t3 kube-proxy is ready t4 node becomes ready (aws-node still not in running state) t5 additional pods are scheduled to it t6 all additional pods stay in state container creating (warnings due to failed to assign IP address to container) t7 aws-node pods is ready t8 new pods get ready within seconds t9 the additional pods from t6 start to get ready
It looks like the node gets ready too fast before waiting for the aws-node pod...
@runningman84 can you please sure the node logs during this timeline? Mainly we would need to look at the CNI and IPAMD logs in /var/log/aws-routed-eni/
. You can email them to k8s-awscni-triage@amazon.com
and we can take a look
I just sent the logs and we also have an open case id: 170893944201879
Thanks @runningman84, let's work through the support case, as the support team will triage and then bring in the service team if needed.
Hi any news on this one? we have the same issue
The aws support case did not really solve that, we got the suggestion that we should try to use prefix delegation mode or things like that to speedup the ip allocation. The general question is should a node be unready until aws-node is up and running? I could image that that behaviour would also have downsides because images would be pulled before the node is ready. The current situation is that pod might not start due to the ip address thing but at least all images are already pulled before they eventually start fine...
i am facing the same issue on eks 1.28
kubelet shows ready status on node pods start being scheduled onto that node few seconds later node goes into networknotready state pods above get stuck forever few seconds later kubelet switches back to ready state and new pods work, but the earlier ones don't
seems i will have to make my own workaround by monitoring new nodes myself and assigning a label aged=y after they have been there for a minute
then make all my pods have a nodeaffinity looking for that label
ideally aws pods would add label to the node themself
any ideas @jdn5126 ?
The general question is should a node be unready until aws-node is up and running?
The node should be marked as Not Ready, until aws-node pod copies the configuration files of CNI to /etc/cni/net.d/, which it does after it finishes initialization.
Marking the node as schedulable (Ready or Not Ready) is marked by kubelet process. CNI should not be tainting or untainting the node.
@tooptoop4 - did you manage to resolve this issue at your end?
@orsenthil not resolved, the node goes 1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days
still an issue in EKS v1.29.7 with amazon-k8s-cni:v1.18.2
kubectl events shows this timeline:
default 40m Normal Starting node/redactip kubelet, redactip Starting kubelet. 40m 1 redactip default 40m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 40m 1 redactip default 40m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 40m 2 redactip default 40m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 40m 2 redactip default 40m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 40m 2 redactip default 40m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 40m 1 redactip default 39m Normal Synced node/redactip cloud-node-controller Node synced successfully 39m 1 redactip default 39m Normal RegisteredNode node/redactip node-controller Node redactip event: Registered Node redactip in Controller 39m 1 redactip default 39m Normal Starting node/redactip kube-proxy, kube-proxy-redactip 39m 1 redactip default 39m Normal NodeReady node/redactip kubelet, redactip Node redactip status is now: NodeReady 39m 1 redactip default 39m Normal Starting node/redactip kubelet, redactip Starting kubelet. 39m 1 redactip default 39m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 39m 1 redactip default 39m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 39m 2 redactip default 39m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 39m 2 redactip default 39m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 39m 2 redactip default 39m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 39m 1 redactip
I have two, somehow opposite points:
@runningman84 @tooptoop4 Have you tried the following type of debugging to narrow down the timing of the issue?
Example 1
fields @timestamp, @message
| filter @message like /ip-<redacted>.<redacted AWS region>.compute.internal/
| sort @timestamp asc
Example 2 (remove # to view only messages for userAgent kubelet
fields @timestamp, @message
| filter @message like /<K8s node name>/
#| filter userAgent like /kubelet/
| filter @message like /kubelet is posting ready status/
| display @timestamp, userAgent, verb, requestURI
| sort @timestamp asc
Note: There are some CloudWatch costs associated with this.
Another ask just to be sure: Have you modified kubelet maxPods in a way that it is larger then number of IP for this particular instance type, see Maximum IP addresses per network interface?
1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.
This is strange. Do you have any logs that indicate why the node is getting marked as NotReady from the Ready State. This is key to understanding the behavior here. The pods were probably scheduled when node was marked in the Ready State, but then it went to NotReady for some reason.
@runningman84 @tooptoop4 if you have additional logs please send them to k8s-awscni-triage@amazon.com
.
For collecting the logs either use admin container and either chroot or sheltie to get (based on this issue here)
/var/log/aws-routed-eni/ipamd.log
/var/log/aws-routed-eni/plugin.log
@orsenthil Anything else? Please advise!
Thank you!
@youwalther65
kubelet logs for the affected node:
I0904 14:03:22.313659 2015 log.go:245] http: TLS handshake error from 127.0.0.1:57056: no serving certificate available for the kubelet"}
I0904 14:03:23.058421 2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"auth/fluent-bit-n4cvm\" podStartSLOduration=2.7230901039999997 podStartE2EDuration=\"6.058375092s\" podCreationTimestamp=\"2024-09-04 14:03:17 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:18.740862166 +0000 UTC m=+13.605813629\" lastFinishedPulling=\"2024-09-04 14:03:22.076147154 +0000 UTC m=+16.941098617\" observedRunningTime=\"2024-09-04 14:03:23.056421468 +0000 UTC m=+17.921372941\" watchObservedRunningTime=\"2024-09-04 14:03:23.058375092 +0000 UTC m=+17.923326545\""}
I0904 14:03:26.198922 2015 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: ebs.csi.aws.com endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock versions: 1.0.0"}
I0904 14:03:26.199009 2015 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock"}
I0904 14:03:27.072469 2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/ebs-csi-node-lcm6l\" podStartSLOduration=14.045209864 podStartE2EDuration=\"21.072403703s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:19.39328969 +0000 UTC m=+14.258241143\" lastFinishedPulling=\"2024-09-04 14:03:26.420483529 +0000 UTC m=+21.285434982\" observedRunningTime=\"2024-09-04 14:03:27.070570631 +0000 UTC m=+21.935522104\" watchObservedRunningTime=\"2024-09-04 14:03:27.072403703 +0000 UTC m=+21.937355156\""}
I0904 14:03:27.081623 2015 topology_manager.go:215] \"Topology Admit Handler\" podUID=\"6b9f76aa-3164-4a7c-87be-25091855c916\" podNamespace=\"auth\" podName=\"redactmypodtooearly\""}
I0904 14:03:27.172891 2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"var-run-argo\\\" (UniqueName: \\\"kubernetes.io/empty-dir/6b9f76aa-3164-4a7c-87be-25091855c916-var-run-argo\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
I0904 14:03:27.172985 2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"redactacm-sh\\\" (UniqueName: \\\"kubernetes.io/configmap/6b9f76aa-3164-4a7c-87be-25091855c916-redactacm-sh\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
E0904 14:03:27.395878 2015 kubelet_pods.go:417] \"Hostname for pod was too long, truncated it\" podName=\"redactmypodtooearly\" hostnameMaxLen=63 truncatedHostname=\"redactmypodtooearly\""}
E0904 14:03:36.349734 2015 cadvisor_stats_provider.go:501] \"Partial failure issuing cadvisor.ContainerInfoV2\" err=\"partial failures: [\\\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfb3b1b0e_c2b1_43a9_9fb4_61f90fccbe2e.slice/cri-containerd-ec9066265bcb1a93f47607cabf6defe2be7b46dccd9620fec39caf57efe0707e.scope\\\": RecentStats: unable to find data in memory cache]\""}
I0904 14:03:37.104227 2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/node-configuration-dqmn7\" podStartSLOduration=18.792128877 podStartE2EDuration=\"31.104161659s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:23.705144657 +0000 UTC m=+18.570096110\" lastFinishedPulling=\"2024-09-04 14:03:36.017177439 +0000 UTC m=+30.882128892\" observedRunningTime=\"2024-09-04 14:03:37.097340999 +0000 UTC m=+31.962292482\" watchObservedRunningTime=\"2024-09-04 14:03:37.104161659 +0000 UTC m=+31.969113122\""}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"Image exists, fetching cached image from image store\" ref=\"ecr.aws/arn:aws:ecr:ap-southeast-2:602401143452:repository/eks/pause:3.1-eksbuild.1\"","_PID":"5377","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4"}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"tagging image\" img=\"602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1\""}
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_COMM":"kubelet","_EXE":"/x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --cloud-provider external --kubeconfig /etc/kubernetes/kubelet/kubeconfig --config /etc/kubernetes/kubelet/config --container-runtime-endpoint=unix:///run/containerd/containerd.sock --containerd=/run/containerd/containerd.sock --root-dir /var/lib/kubelet --cert-dir /var/lib/kubelet/pki --image-credential-provider-bin-dir /usr/libexec/kubernetes/kubelet/plugins --image-credential-provider-config /etc/kubernetes/kubelet/credential-provider-config.yaml --hostname-override redactmyip.ap-southeast-2.compute.internal --node-ip redactmyipfmt --node-labels eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=redactng,eks.amazonaws.com/nodegroup-image=redactami,eks.amazonaws.com/sourceLaunchTemplateId=redactlt,eks.amazonaws.com/sourceLaunchTemplateVersion=1,type=large --register-with-taints \"\" --pod-infra-container-image 602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1 --runtime-cgroups=/runtime.slice/containerd.service","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
I0904 14:03:38.957155 5413 server.go:204] \"--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime\""}
I0904 14:03:38.962790 5413 server.go:487] \"Kubelet version\" kubeletVersion=\"v1.29.5-eks-1109419\""}
I0904 14:03:38.962860 5413 server.go:489] \"Golang settings\" GOGC=\"\" GOMAXPROCS=\"\" GOTRACEBACK=\"\""}
I0904 14:03:38.966945 5413 dynamic_cafile_content.go:157] \"Starting controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}
I0904 14:03:38.983684 5413 container_manager_linux.go:265] \"Container manager verified user specified cgroup-root exists\" cgroupRoot=[]"}
I0904 14:03:38.983990 5413 container_manager_linux.go:270] \"Creating Container Manager object based on Node Config\" nodeConfig={\"RuntimeCgroupsName\":\"/runtime.slice/containerd.service\",\"SystemCgroupsName\":\"\",\"KubeletCgroupsName\":\"\",\"KubeletOOMScoreAdj\":-999,\"ContainerRuntime\":\"\",\"CgroupsPerQOS\":true,\"CgroupRoot\":\"/\",\"CgroupDriver\":\"systemd\",\"KubeletRootDir\":\"/var/lib/kubelet\",\"ProtectKernelDefaults\":true,\"KubeReservedCgroupName\":\"/runtime\",\"SystemReservedCgroupName\":\"\",\"ReservedSystemCPUs\":{},\"EnforceNodeAllocatable\":{\"pods\":{}},\"KubeReserved\":{\"cpu\":\"90m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"739Mi\"},\"SystemReserved\":null,\"HardEvictionThresholds\":[{\"Signal\":\"memory.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":\"100Mi\",\"Percentage\":0},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.1},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.inodesFree\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.05},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"imagefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.15},\"GracePeriod\":0,\"MinReclaim\":null}],\"QOSReserved\":{},\"CPUManagerPolicy\":\"none\",\"CPUManagerPolicyOptions\":null,\"TopologyManagerScope\":\"container\",\"CPUManagerReconcilePeriod\":10000000000,\"ExperimentalMemoryManagerPolicy\":\"None\",\"ExperimentalMemoryManagerReservedMemory\":null,\"PodPidsLimit\":1048576,\"EnforceCPULimits\":true,\"CPUCFSQuotaPeriod\":100000000,\"TopologyManagerPolicy\":\"none\",\"TopologyManagerPolicyOptions\":null}"}
I0904 14:03:38.984045 5413 topology_manager.go:138] \"Creating topology manager with none policy\""}
I0904 14:03:38.984060 5413 container_manager_linux.go:301] \"Creating device plugin manager\""}
I0904 14:03:38.984104 5413 state_mem.go:36] \"Initialized new in-memory state store\""}
I0904 14:03:38.984248 5413 kubelet.go:396] \"Attempting to sync node with API server\""}
I0904 14:03:38.984285 5413 kubelet.go:301] \"Adding static pod path\" path=\"/etc/kubernetes/static-pods/\""}
I0904 14:03:38.984350 5413 kubelet.go:312] \"Adding apiserver pod source\""}
I0904 14:03:38.984370 5413 apiserver.go:42] \"Waiting for node sync before watching apiserver pods\""}
I0904 14:03:38.985350 5413 kuberuntime_manager.go:258] \"Container runtime initialized\" containerRuntime=\"containerd\" version=\"1.6.34+bottlerocket\" apiVersion=\"v1\""}
I0904 14:03:38.986115 5413 certificate_store.go:130] Loading cert/key pair from \"/var/lib/kubelet/pki/kubelet-server-current.pem\"."}
I0904 14:03:38.986545 5413 kubelet.go:809] \"Not starting ClusterTrustBundle informer because we are in static kubelet mode\""}
I0904 14:03:38.987073 5413 server.go:1256] \"Started kubelet\""}
I0904 14:03:38.987222 5413 server.go:162] \"Starting to listen\" address=\"0.0.0.0\" port=10250"}
I0904 14:03:38.988733 5413 ratelimit.go:55] \"Setting rate limiting for endpoint\" service=\"podresources\" qps=100 burstTokens=10"}
I0904 14:03:38.989829 5413 server.go:233] \"Starting to serve the podresources API\" endpoint=\"unix:/var/lib/kubelet/pod-resources/kubelet.sock\""}
I0904 14:03:38.991602 5413 fs_resource_analyzer.go:67] \"Starting FS ResourceAnalyzer\""}
I0904 14:03:38.992840 5413 volume_manager.go:291] \"Starting Kubelet Volume Manager\""}
I0904 14:03:38.993006 5413 server.go:461] \"Adding debug handlers to kubelet server\""}
I0904 14:03:38.993240 5413 desired_state_of_world_populator.go:151] \"Desired state populator starts to run\""}
I0904 14:03:38.996540 5413 factory.go:221] Registration of the systemd container factory successfully"}
I0904 14:03:38.997429 5413 factory.go:219] Registration of the crio container factory failed: Get \"http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info\": dial unix /var/run/crio/crio.sock: connect: no such file or directory"}
I0904 14:03:38.998756 5413 dynamic_cafile_content.go:171] \"Shutting down controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}
one of my pods got allocated to this node by argoworkflows at 14:03:26 and got stuck
@tooptoop4 Thank you, I see. But we need the other logs mentioned here for this timeframe as well!
What would you like to be added: In our eks bottlerocket usecase we see karpenter provisioning nodes which get the aws-node pod and some application pods upon start. Unfortunatly the aws-node pod takes several dozens of seconds to get ready. The application pod try to start in the meantime and get error because they do not get an ip address.
Should we use karpenter to taint the nodes until the aws-node is ready? Is the cni plugin able to remove a startup taint once it is ready?
Why is this needed: Pods constantly fail during startup due to missing ip addresses because the aws-node pod is not ready.