aws-node should untaint node

runningman84 commented 7 months ago

What would you like to be added: In our eks bottlerocket usecase we see karpenter provisioning nodes which get the aws-node pod and some application pods upon start. Unfortunatly the aws-node pod takes several dozens of seconds to get ready. The application pod try to start in the meantime and get error because they do not get an ip address.

Should we use karpenter to taint the nodes until the aws-node is ready? Is the cni plugin able to remove a startup taint once it is ready?

Why is this needed: Pods constantly fail during startup due to missing ip addresses because the aws-node pod is not ready.

jdn5126 commented 7 months ago

@runningman84 the node should not be marked as "Ready" until the aws-node pod copies the CNI to /etc/cni/net.d/, which it does after it finishes initialization. So the scheduler should not schedule application pods on a "Not Ready" node (unless those pods tolerate the node being not ready for some reason)

jdn5126 commented 7 months ago

The VPC CNI plugin is not able to modify node taints, as that would be a sizable security risk

runningman84 commented 7 months ago

okay what could be the reason for us seeing this behaviour?

jdn5126 commented 7 months ago

okay what could be the reason for us seeing this behaviour?

Do you see the nodes as "Not Ready" during this time window? Do these application pods have their own tolerations?

runningman84 commented 7 months ago

I have double check that ...

t = point in time... can be seconds or even minute between two numbers

t0 node appears as not ready t1 daemon set pods are scheduled to it t2 aws-node pod is in initializing t2 kube-proxy pod is in initializing t3 kube-proxy is ready t4 node becomes ready (aws-node still not in running state) t5 additional pods are scheduled to it t6 all additional pods stay in state container creating (warnings due to failed to assign IP address to container) t7 aws-node pods is ready t8 new pods get ready within seconds t9 the additional pods from t6 start to get ready

It looks like the node gets ready too fast before waiting for the aws-node pod...

jdn5126 commented 7 months ago

@runningman84 can you please sure the node logs during this timeline? Mainly we would need to look at the CNI and IPAMD logs in /var/log/aws-routed-eni/. You can email them to k8s-awscni-triage@amazon.com and we can take a look

runningman84 commented 6 months ago

I just sent the logs and we also have an open case id: 170893944201879

jdn5126 commented 6 months ago

Thanks @runningman84, let's work through the support case, as the support team will triage and then bring in the service team if needed.

mathieuherbert commented 5 months ago

Hi any news on this one? we have the same issue

runningman84 commented 5 months ago

The aws support case did not really solve that, we got the suggestion that we should try to use prefix delegation mode or things like that to speedup the ip allocation. The general question is should a node be unready until aws-node is up and running? I could image that that behaviour would also have downsides because images would be pulled before the node is ready. The current situation is that pod might not start due to the ip address thing but at least all images are already pulled before they eventually start fine...

tooptoop4 commented 4 months ago

i am facing the same issue on eks 1.28

kubelet shows ready status on node pods start being scheduled onto that node few seconds later node goes into networknotready state pods above get stuck forever few seconds later kubelet switches back to ready state and new pods work, but the earlier ones don't

seems i will have to make my own workaround by monitoring new nodes myself and assigning a label aged=y after they have been there for a minute

then make all my pods have a nodeaffinity looking for that label

ideally aws pods would add label to the node themself

any ideas @jdn5126 ?

orsenthil commented 2 months ago

The general question is should a node be unready until aws-node is up and running?

The node should be marked as Not Ready, until aws-node pod copies the configuration files of CNI to /etc/cni/net.d/, which it does after it finishes initialization.

Marking the node as schedulable (Ready or Not Ready) is marked by kubelet process. CNI should not be tainting or untainting the node.

@tooptoop4 - did you manage to resolve this issue at your end?

tooptoop4 commented 2 months ago

@orsenthil not resolved, the node goes 1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

tooptoop4 commented 2 weeks ago

still an issue in EKS v1.29.7 with amazon-k8s-cni:v1.18.2

kubectl events shows this timeline:

default 40m Normal Starting node/redactip kubelet, redactip Starting kubelet. 40m 1 redactip default 40m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 40m 1 redactip default 40m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 40m 2 redactip default 40m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 40m 2 redactip default 40m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 40m 2 redactip default 40m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 40m 1 redactip default 39m Normal Synced node/redactip cloud-node-controller Node synced successfully 39m 1 redactip default 39m Normal RegisteredNode node/redactip node-controller Node redactip event: Registered Node redactip in Controller 39m 1 redactip default 39m Normal Starting node/redactip kube-proxy, kube-proxy-redactip 39m 1 redactip default 39m Normal NodeReady node/redactip kubelet, redactip Node redactip status is now: NodeReady 39m 1 redactip default 39m Normal Starting node/redactip kubelet, redactip Starting kubelet. 39m 1 redactip default 39m Warning InvalidDiskCapacity node/redactip kubelet, redactip invalid capacity 0 on image filesystem 39m 1 redactip default 39m Normal NodeHasSufficientMemory node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientMemory 39m 2 redactip default 39m Normal NodeHasNoDiskPressure node/redactip kubelet, redactip Node redactip status is now: NodeHasNoDiskPressure 39m 2 redactip default 39m Normal NodeHasSufficientPID node/redactip kubelet, redactip Node redactip status is now: NodeHasSufficientPID 39m 2 redactip default 39m Normal NodeAllocatableEnforced node/redactip kubelet, redactip Updated Node Allocatable limit across pods 39m 1 redactip

youwalther65 commented 2 weeks ago

I have two, somehow opposite points:

Other CNI like Cilium together with Karpenter works just fine - Karpenter NodePool adds the Cilium startupTaint and Cilium CNI removes it after successful start. So why can't AWS VPC CNI be implemented that way too? Sure, it requires some more permissions to patch the K8s nodes object, so it might be a security concern.
On the other side we need to figure out, what is happening here and causing pods scheduled/kubelet showing node Ready even if AWS VCI CNI is not ready. Because this works in almost all cases we must have some kind of corner case here.

youwalther65 commented 2 weeks ago

@runningman84 @tooptoop4 Have you tried the following type of debugging to narrow down the timing of the issue?

Enable all EKS control plane logs
Use CloudWatch Logs Insights queries for the control plane CloudWatch log group with appropriate timeframe to watch the node and AWS VPC CNI pod lifecycle or other things like kubelet posting ready state

Example 1

fields @timestamp, @message
| filter @message like /ip-<redacted>.<redacted AWS region>.compute.internal/
| sort @timestamp asc

Example 2 (remove # to view only messages for userAgent kubelet

fields @timestamp, @message
| filter @message like /<K8s node name>/
#| filter userAgent like /kubelet/
| filter @message like /kubelet is posting ready status/
| display @timestamp, userAgent, verb, requestURI
| sort @timestamp asc

Note: There are some CloudWatch costs associated with this.

Another ask just to be sure: Have you modified kubelet maxPods in a way that it is larger then number of IP for this particular instance type, see Maximum IP addresses per network interface?

orsenthil commented 2 weeks ago

1.NotReady>2.Ready>3.NotReady>4.Ready pods are getting scheduled and stuck between 2. and 3.

This is strange. Do you have any logs that indicate why the node is getting marked as NotReady from the Ready State. This is key to understanding the behavior here. The pods were probably scheduled when node was marked in the Ready State, but then it went to NotReady for some reason.

youwalther65 commented 2 weeks ago

@runningman84 @tooptoop4 if you have additional logs please send them to k8s-awscni-triage@amazon.com. For collecting the logs either use admin container and either chroot or sheltie to get (based on this issue here)

journalctl -xeu kubelet
journalctl -xeu containerd
content of /var/log/aws-routed-eni/ipamd.log
content of /var/log/aws-routed-eni/plugin.log

@orsenthil Anything else? Please advise!

Thank you!

tooptoop4 commented 2 weeks ago

@youwalther65

kubelet logs for the affected node:

I0904 14:03:22.313659    2015 log.go:245] http: TLS handshake error from 127.0.0.1:57056: no serving certificate available for the kubelet"}
I0904 14:03:23.058421    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"auth/fluent-bit-n4cvm\" podStartSLOduration=2.7230901039999997 podStartE2EDuration=\"6.058375092s\" podCreationTimestamp=\"2024-09-04 14:03:17 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:18.740862166 +0000 UTC m=+13.605813629\" lastFinishedPulling=\"2024-09-04 14:03:22.076147154 +0000 UTC m=+16.941098617\" observedRunningTime=\"2024-09-04 14:03:23.056421468 +0000 UTC m=+17.921372941\" watchObservedRunningTime=\"2024-09-04 14:03:23.058375092 +0000 UTC m=+17.923326545\""}
I0904 14:03:26.198922    2015 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: ebs.csi.aws.com endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock versions: 1.0.0"}
I0904 14:03:26.199009    2015 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock"}
I0904 14:03:27.072469    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/ebs-csi-node-lcm6l\" podStartSLOduration=14.045209864 podStartE2EDuration=\"21.072403703s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:19.39328969 +0000 UTC m=+14.258241143\" lastFinishedPulling=\"2024-09-04 14:03:26.420483529 +0000 UTC m=+21.285434982\" observedRunningTime=\"2024-09-04 14:03:27.070570631 +0000 UTC m=+21.935522104\" watchObservedRunningTime=\"2024-09-04 14:03:27.072403703 +0000 UTC m=+21.937355156\""}
I0904 14:03:27.081623    2015 topology_manager.go:215] \"Topology Admit Handler\" podUID=\"6b9f76aa-3164-4a7c-87be-25091855c916\" podNamespace=\"auth\" podName=\"redactmypodtooearly\""}
I0904 14:03:27.172891    2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"var-run-argo\\\" (UniqueName: \\\"kubernetes.io/empty-dir/6b9f76aa-3164-4a7c-87be-25091855c916-var-run-argo\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
I0904 14:03:27.172985    2015 reconciler_common.go:258] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"redactacm-sh\\\" (UniqueName: \\\"kubernetes.io/configmap/6b9f76aa-3164-4a7c-87be-25091855c916-redactacm-sh\\\") pod \\\"redactmypodtooearly\\\" (UID: \\\"6b9f76aa-3164-4a7c-87be-25091855c916\\\") \" pod=\"auth/redactmypodtooearly\""}
E0904 14:03:27.395878    2015 kubelet_pods.go:417] \"Hostname for pod was too long, truncated it\" podName=\"redactmypodtooearly\" hostnameMaxLen=63 truncatedHostname=\"redactmypodtooearly\""}
E0904 14:03:36.349734    2015 cadvisor_stats_provider.go:501] \"Partial failure issuing cadvisor.ContainerInfoV2\" err=\"partial failures: [\\\"/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podfb3b1b0e_c2b1_43a9_9fb4_61f90fccbe2e.slice/cri-containerd-ec9066265bcb1a93f47607cabf6defe2be7b46dccd9620fec39caf57efe0707e.scope\\\": RecentStats: unable to find data in memory cache]\""}
I0904 14:03:37.104227    2015 pod_startup_latency_tracker.go:102] \"Observed pod startup duration\" pod=\"kube-system/node-configuration-dqmn7\" podStartSLOduration=18.792128877 podStartE2EDuration=\"31.104161659s\" podCreationTimestamp=\"2024-09-04 14:03:06 +0000 UTC\" firstStartedPulling=\"2024-09-04 14:03:23.705144657 +0000 UTC m=+18.570096110\" lastFinishedPulling=\"2024-09-04 14:03:36.017177439 +0000 UTC m=+30.882128892\" observedRunningTime=\"2024-09-04 14:03:37.097340999 +0000 UTC m=+31.962292482\" watchObservedRunningTime=\"2024-09-04 14:03:37.104161659 +0000 UTC m=+31.969113122\""}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"Image exists, fetching cached image from image store\" ref=\"ecr.aws/arn:aws:ecr:ap-southeast-2:602401143452:repository/eks/pause:3.1-eksbuild.1\"","_PID":"5377","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4"}
time=\"2024-09-04T14:03:38Z\" level=info msg=\"tagging image\" img=\"602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1\""}
Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_COMM":"kubelet","_EXE":"/x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/kubelet","_CMDLINE":"/usr/bin/kubelet --cloud-provider external --kubeconfig /etc/kubernetes/kubelet/kubeconfig --config /etc/kubernetes/kubelet/config --container-runtime-endpoint=unix:///run/containerd/containerd.sock --containerd=/run/containerd/containerd.sock --root-dir /var/lib/kubelet --cert-dir /var/lib/kubelet/pki --image-credential-provider-bin-dir /usr/libexec/kubernetes/kubelet/plugins --image-credential-provider-config /etc/kubernetes/kubelet/credential-provider-config.yaml --hostname-override redactmyip.ap-southeast-2.compute.internal --node-ip redactmyipfmt --node-labels eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=redactng,eks.amazonaws.com/nodegroup-image=redactami,eks.amazonaws.com/sourceLaunchTemplateId=redactlt,eks.amazonaws.com/sourceLaunchTemplateVersion=1,type=large --register-with-taints \"\" --pod-infra-container-image 602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/eks/pause:3.1-eksbuild.1 --runtime-cgroups=/runtime.slice/containerd.service","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.","_SYSTEMD_INVOCATION_ID":"314655562d714198a878006d98ee77b4","_STREAM_ID":"897e8d904fab461fa30b20d6e0876866","_PID":"5413"}
I0904 14:03:38.957155    5413 server.go:204] \"--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime\""}
I0904 14:03:38.962790    5413 server.go:487] \"Kubelet version\" kubeletVersion=\"v1.29.5-eks-1109419\""}
I0904 14:03:38.962860    5413 server.go:489] \"Golang settings\" GOGC=\"\" GOMAXPROCS=\"\" GOTRACEBACK=\"\""}
I0904 14:03:38.966945    5413 dynamic_cafile_content.go:157] \"Starting controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}
I0904 14:03:38.983684    5413 container_manager_linux.go:265] \"Container manager verified user specified cgroup-root exists\" cgroupRoot=[]"}
I0904 14:03:38.983990    5413 container_manager_linux.go:270] \"Creating Container Manager object based on Node Config\" nodeConfig={\"RuntimeCgroupsName\":\"/runtime.slice/containerd.service\",\"SystemCgroupsName\":\"\",\"KubeletCgroupsName\":\"\",\"KubeletOOMScoreAdj\":-999,\"ContainerRuntime\":\"\",\"CgroupsPerQOS\":true,\"CgroupRoot\":\"/\",\"CgroupDriver\":\"systemd\",\"KubeletRootDir\":\"/var/lib/kubelet\",\"ProtectKernelDefaults\":true,\"KubeReservedCgroupName\":\"/runtime\",\"SystemReservedCgroupName\":\"\",\"ReservedSystemCPUs\":{},\"EnforceNodeAllocatable\":{\"pods\":{}},\"KubeReserved\":{\"cpu\":\"90m\",\"ephemeral-storage\":\"1Gi\",\"memory\":\"739Mi\"},\"SystemReserved\":null,\"HardEvictionThresholds\":[{\"Signal\":\"memory.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":\"100Mi\",\"Percentage\":0},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.1},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"nodefs.inodesFree\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.05},\"GracePeriod\":0,\"MinReclaim\":null},{\"Signal\":\"imagefs.available\",\"Operator\":\"LessThan\",\"Value\":{\"Quantity\":null,\"Percentage\":0.15},\"GracePeriod\":0,\"MinReclaim\":null}],\"QOSReserved\":{},\"CPUManagerPolicy\":\"none\",\"CPUManagerPolicyOptions\":null,\"TopologyManagerScope\":\"container\",\"CPUManagerReconcilePeriod\":10000000000,\"ExperimentalMemoryManagerPolicy\":\"None\",\"ExperimentalMemoryManagerReservedMemory\":null,\"PodPidsLimit\":1048576,\"EnforceCPULimits\":true,\"CPUCFSQuotaPeriod\":100000000,\"TopologyManagerPolicy\":\"none\",\"TopologyManagerPolicyOptions\":null}"}
I0904 14:03:38.984045    5413 topology_manager.go:138] \"Creating topology manager with none policy\""}
I0904 14:03:38.984060    5413 container_manager_linux.go:301] \"Creating device plugin manager\""}
I0904 14:03:38.984104    5413 state_mem.go:36] \"Initialized new in-memory state store\""}
I0904 14:03:38.984248    5413 kubelet.go:396] \"Attempting to sync node with API server\""}
I0904 14:03:38.984285    5413 kubelet.go:301] \"Adding static pod path\" path=\"/etc/kubernetes/static-pods/\""}
I0904 14:03:38.984350    5413 kubelet.go:312] \"Adding apiserver pod source\""}
I0904 14:03:38.984370    5413 apiserver.go:42] \"Waiting for node sync before watching apiserver pods\""}
I0904 14:03:38.985350    5413 kuberuntime_manager.go:258] \"Container runtime initialized\" containerRuntime=\"containerd\" version=\"1.6.34+bottlerocket\" apiVersion=\"v1\""}
I0904 14:03:38.986115    5413 certificate_store.go:130] Loading cert/key pair from \"/var/lib/kubelet/pki/kubelet-server-current.pem\"."}
I0904 14:03:38.986545    5413 kubelet.go:809] \"Not starting ClusterTrustBundle informer because we are in static kubelet mode\""}
I0904 14:03:38.987073    5413 server.go:1256] \"Started kubelet\""}
I0904 14:03:38.987222    5413 server.go:162] \"Starting to listen\" address=\"0.0.0.0\" port=10250"}
I0904 14:03:38.988733    5413 ratelimit.go:55] \"Setting rate limiting for endpoint\" service=\"podresources\" qps=100 burstTokens=10"}
I0904 14:03:38.989829    5413 server.go:233] \"Starting to serve the podresources API\" endpoint=\"unix:/var/lib/kubelet/pod-resources/kubelet.sock\""}
I0904 14:03:38.991602    5413 fs_resource_analyzer.go:67] \"Starting FS ResourceAnalyzer\""}
I0904 14:03:38.992840    5413 volume_manager.go:291] \"Starting Kubelet Volume Manager\""}
I0904 14:03:38.993006    5413 server.go:461] \"Adding debug handlers to kubelet server\""}
I0904 14:03:38.993240    5413 desired_state_of_world_populator.go:151] \"Desired state populator starts to run\""}
I0904 14:03:38.996540    5413 factory.go:221] Registration of the systemd container factory successfully"}
I0904 14:03:38.997429    5413 factory.go:219] Registration of the crio container factory failed: Get \"http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info\": dial unix /var/run/crio/crio.sock: connect: no such file or directory"}
I0904 14:03:38.998756    5413 dynamic_cafile_content.go:171] \"Shutting down controller\" name=\"client-ca-bundle::/etc/kubernetes/pki/ca.crt\""}

one of my pods got allocated to this node by argoworkflows at 14:03:26 and got stuck

youwalther65 commented 2 weeks ago

@tooptoop4 Thank you, I see. But we need the other logs mentioned here for this timeframe as well!

aws / amazon-vpc-cni-k8s

aws-node should untaint node #2808