aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 741 forks source link

aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425

Closed daganida88 closed 1 year ago

daganida88 commented 3 years ago

What happened: aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state.

Attach logs aws-node {"level":"info","ts":"2021-04-13T12:24:31.015Z","caller":"entrypoint.sh","msg":"Install CNI binary.."} │ │ aws-node {"level":"info","ts":"2021-04-13T12:24:31.031Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "} │ │ aws-node {"level":"info","ts":"2021-04-13T12:24:31.032Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "} 3:31 aws-vpc-cni-init + PLUGIN_BINS='loopback portmap bandwidth aws-cni-support.sh' │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + '[' '!' -f loopback ']' │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + '[' '!' -f portmap ']' │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + '[' '!' -f bandwidth ']' │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + '[' '!' -f aws-cni-support.sh ']' │ │ aws-vpc-cni-init Copying CNI plugin binaries ... │ │ aws-vpc-cni-init + HOST_CNI_BIN_PATH=/host/opt/cni/bin │ │ aws-vpc-cni-init + echo 'Copying CNI plugin binaries ... ' │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + install loopback /host/opt/cni/bin │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + install portmap /host/opt/cni/bin │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + install bandwidth /host/opt/cni/bin │ │ aws-vpc-cni-init + for b in '$PLUGIN_BINS' │ │ aws-vpc-cni-init + install aws-cni-support.sh /host/opt/cni/bin │ │ aws-vpc-cni-init + echo 'Configure rp_filter loose... ' │ │ aws-vpc-cni-init Configure rp_filter loose... │ │ aws-vpc-cni-init ++ curl -X PUT http://169.254.169.254/latest/api/token -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' │ │ aws-vpc-cni-init % Total % Received % Xferd Average Speed Time Time Time Current │ │ aws-vpc-cni-init Dload Upload Total Spent Left Speed │ │ aws-vpc-cni-init 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 56 100 56 0 0 56000 0 --:--:-- --:--:-- --:--:-- 56000 │ │ aws-vpc-cni-init + TOKEN=AQAEABqkV8qO_waLfVT_6TITDHDBvmAy3jkblGe9YXSpR-irRxvwJQ== │ │ aws-vpc-cni-init ++ curl -H 'X-aws-ec2-metadata-token: AQAEABqkV8qO_waLfVT_6TITDHDBvmAy3jkblGe9YXSpR-irRxvwJQ==' http://169.254.169.254/latest/meta-data/local-ipv4 │ │ aws-vpc-cni-init % Total % Received % Xferd Average Speed Time Time Time Current │ │ aws-vpc-cni-init Dload Upload Total Spent Left Speed │ │ aws-vpc-cni-init 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 13 100 13 0 0 13000 0 --:--:-- --:--:-- --:--:-- 13000 │ │ aws-vpc-cni-init + HOST_IP=172.19.224.26 │ │ aws-vpc-cni-init ++ ip -4 -o a │ │ aws-vpc-cni-init ++ awk '{print $2}' │ │ aws-vpc-cni-init ++ grep 172.19.224.26/ │ │ aws-vpc-cni-init + PRIMARY_IF=eth0 │ │ aws-vpc-cni-init + sysctl -w net.ipv4.conf.eth0.rp_filter=2 │ │ aws-vpc-cni-init net.ipv4.conf.eth0.rp_filter = 2 │ │ aws-vpc-cni-init + cat /proc/sys/net/ipv4/conf/eth0/rp_filter │ │ aws-vpc-cni-init 2 │ │ aws-vpc-cni-init + '[' false == true ']' │ │ aws-vpc-cni-init + sysctl -e -w net.ipv4.tcp_early_demux=1 │ │ aws-vpc-cni-init net.ipv4.tcp_early_demux = 1 │ │ aws-vpc-cni-init CNI init container done │ │ aws-vpc-cni-init + echo 'CNI init container done'

What you expected to happen: I expected the pod not to crush

How to reproduce it (as minimally and precisely as possible): Sometimes it happens sometimes it's not. but when it happens, until we don't kill the pod, the node will stay on zombie state where all pods stuck on ContainerCreating Anything else we need to know?:

Running on EKS 1.16.15 Linux Cni version: 1.7.5

jayanthvn commented 3 years ago

Hi @daganida88

Exit code 137 is OOM (out of memory), so does it happen on startup or after several hours of aws-node running. Can you also please check if any other process is consuming high memory, since memory allotted to the Docker containers is limited by the total available memory in the host machine and when usage increases, the available free memory may be insufficient for all the containers and hence containers may crash.

jayanthvn commented 3 years ago

Hi @daganida88

Can you please confirm if any other process is consuming high memory?

jayanthvn commented 3 years ago

Hi @daganida88

If you could share kubelet logs when the issue happens, it would help us debug.

Thanks!

trucnguyenlam commented 3 years ago

Please address this issue, we're experiencing the same error (it is sporadical) and there is no way we can reproduce that.

jayanthvn commented 3 years ago

I understand this issue happens sporadically but if you could share the information on CNI version, node and pod scale, if there is any churn with pod scale then we can try to repro the issue. Also next time when you see the issue if you could run sudo bash /opt/cni/bin/aws-cni-support.sh on the node then we can review the kubelet logs and memory consumption on the node. Also if you have any tool to monitor aws-node pod's memory consumption and verify if it not growing over time then that would rule out aws-node issue.

Thanks!

trucnguyenlam commented 3 years ago

@jayanthvn cni version is 1.7.5, nodes is controlled by autoscaler as recommended by eks documentation (node type is m5.large) Unfortunately, our nodes don't have ssm installed by default so we can't log in to run the command.

I'll have a look on monitoring aws-node memory consumption to check if that is the case.

Cheers!

brettplarson commented 3 years ago

I spoke to our AWS support rep regarding this issue and he seemed to believe that this was expected and due to the CNI container starting before the kube-proxy container became healthy. Just my 2 cents.

On Tue, Jun 15, 2021 at 4:33 PM Truc Nguyen Lam @.***> wrote:

@jayanthvn https://github.com/jayanthvn cni version is 1.7.5, nodes is controlled by autoscaler as recommended by eks documentation (node type is m5.large) Unfortunately, our nodes don't have ssm installed by default so we can't log in to run the command.

I'll have a look on monitoring aws-node memory consumption to check if that is the case.

Cheers!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aws/amazon-vpc-cni-k8s/issues/1425#issuecomment-861848349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIB7WSY44YL6VSZHBU26NV3TS7BJBANCNFSM423IKGDA .

trucnguyenlam commented 3 years ago

I'm still searching for an effective automatic approach to handle this issue, currently the only way to fix it is to manually terminate the node that is stuck.

Legion2 commented 3 years ago

We upgraded our cluster from 1.19 to 1.20 and upgraded the unmanaged vpc-cni addon to vpc-cni 1.8.0 managed EKS Addon. Now we also experience this Issue. Somethime the containers also gets killed by its failing health and readiness probes.

  Normal   Killing    85m                    kubelet  Container aws-node failed liveness probe, will be restarted
  Normal   Pulling    85m (x2 over 106m)     kubelet  Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1"
  Normal   Created    85m (x2 over 106m)     kubelet  Created container aws-node
  Normal   Pulled     85m                    kubelet  Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1" in 158.21763ms
  Normal   Started    85m (x2 over 106m)     kubelet  Started container aws-node
  Warning  Unhealthy  11m (x31 over 106m)    kubelet  Readiness probe failed:
  Warning  Unhealthy  4m57s (x28 over 100m)  kubelet  Liveness probe failed:

The node is a m5.large and it hosts 29 (of 29 possible) Pods. At the time of the last crash the node had a memory usage of 63.3% (332 MiB free memory). The pod has a constant memory usage of around 46.3 MiB as shown in the graph below:

image

The restart at 10 AM was triggered by a failed health probe and the restart at 11 AM was triggered by 137 exit code.

In our cluster we use resource requests and limits and the aws-node pods do not define a memory request. So they will be killed first if there is not enough memory on the node.

Note: The node was initially started with a different cni-plugin version, because of the Kubernetes version upgrade

Legion2 commented 3 years ago

The strange thing is, that the health probes fail without a reason message after ... probe failed:

mickael-ange commented 3 years ago

For a month, after upgrading 4 of our EKS clusters from 1.17 to 1.20 and moving to aws-managed EKS node groups, we see appearing Readiness and Liveness failures for aws-node DaemonSet pods after we deploy/update our application deployments running in another namespace.

The error does not occur at each deployment update, only sporadically.

In the new cluster, I use the default addon, I have not upgraded them.

kubectl -n kube-system describe pod aws-node-cc26m

Normal   Scheduled  49m                default-scheduler                                         Successfully assigned kube-system/aws-node-cc26m to ip-172-26-9-103.ap-southeast-1.compute.internal
  Normal   Pulling    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Pulling image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
  Normal   Pulled     49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1" in 6.284508527s
  Normal   Created    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Created container aws-vpc-cni-init
  Normal   Started    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Started container aws-vpc-cni-init
  Normal   Pulling    49m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Pulling image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
  Normal   Pulled     48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 2.246459067s
  Normal   Created    48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Created container aws-node
  Normal   Started    48m                kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Started container aws-node
  Warning  Unhealthy  17m (x2 over 17m)  kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Readiness probe failed:
  Warning  Unhealthy  7m52s              kubelet, ip-172-26-9-103.ap-southeast-1.compute.internal  Liveness probe failed:

I also think it is strange to not have the reason message after Liveness probe failed: and Readiness probe failed:

Regards

Legion2 commented 3 years ago

I changed the timeoutSeconds of the aws-node livenessProbe to 5s. Since then (2 days) the pods did not restarted but the probes failed sometimes.

I also found that, all previous container restarts can be related to very high CPU utilization of the node (100%) over 30 seconds duration or longer. This is enough time for three health probe timeouts which trigger a container restart.

I think because the pod only has a cpu resource request of 10m it does not get enough cpu time to respond to the health probe in under 1 second. To guarantee such low responds times the pod requires more guaranteed cpu time. However, such a low response time is not required all the time and restarting the pod does not resolve slow response times when the node has very high cpu utilization. So increasing the health probe timeout is the only solution for this problem.

sarbajitdutta commented 3 years ago

@jayanthvn We have the same issue. We updated from 1.19 to 1.20 and also using CNI addons v1.8.0. aws-node keeps on restarting when cluster-autoscaler scaled up. I have tried to increase timeout for liveness probe but does not help.

jayanthvn commented 3 years ago

Sorry for the delayed response.

@Legion2 - Thanks for letting us know, I will look into it.

@sarbajitdutta - You mean on the new nodes the aws-node keeps restarting? Can you please share me the log lines from /var/log/aws-routed-eni/ipamd.log before a restart? Each restart will start with log line - "Starting L-..."

mikkotimoharju commented 3 years ago

We are also encountering these random livenessProbe failures after upgrading to EKS 1.21. Increasing the timeout helps with the restarts. The reason why this has started to happen now is a bug fix in kubernetes 1.20: Fixing Kubelet Exec Probe Timeouts and KEP-1972: kubelet exec probe timeouts. Previously the liveness and readiness probes did not respect the timeout for exec probes and the probe ran indefinitely. Now the default timeout of 1s is respected, but is sometimes too short causing it to fail and pods to restart.

jayanthvn commented 3 years ago

Thanks @mikkotimoharju for sharing this. We will look into it on adjusting the timeouts.

laupow commented 3 years ago

Another version of this story:

When aws-cluster-autoscaler triggers an scale up that results in 10-12 new nodes, we see the CNI plugin crash (exit 137) about the same time.

CNI restarts (a 1d rate graph)

CNI pods are crashing at the beginning of the scale event (not at scale down)

Screen Shot 2021-08-25 at 9 22 49 AM

HPA scale (showing pods)

These scale events resulted in 10-12 new nodes coming online together.

Screen Shot 2021-08-25 at 9 23 03 AM
Containers:
  aws-node:
    Container ID:   docker://5f77d4ef41eb097df676032c48e76745662a0c679c9018b34cf75d9ed67ee446
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:bbc7bd93b5c71ef6525d6d91883442a89e3b1d0e82781750f2cecc8bddc9e1a1
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Tue, 24 Aug 2021 13:00:55 -0500
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 24 Aug 2021 12:59:18 -0500
      Finished:     Tue, 24 Aug 2021 13:00:54 -0500
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=1s timeout=1s period=10s #success=1 #failure=3

We aren't seeing stuck pods in ContainerCreating, though.

jayanthvn commented 3 years ago

Hi,

There are 2 issues mentioned here -

  1. With EKS 1.21, the timeout seems to be honored because of the bug fix in 1.20 as mentioned here - https://github.com/aws/amazon-vpc-cni-k8s/issues/1425#issuecomment-901796963. This can be configured based on the load on the system. We can document this similar to Ref: Configure Probes [https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/] /cc @haouc
  2. During auto scaling event aws-node has crashed with error code 137 which is OOM, if you can please open a support case and share us the cni/kubelet logs it would be helpful to confirm the behavior. /cc @sarbajitdutta and @laupow

Thanks.

achevuru commented 3 years ago

In an ideal scenario, default timeoutSeconds value (1) should be good enough as long as the aws-node pod is not starved of CPU resources. So, even with exec probe timeout being enforced from 1.20+ we should be Ok and these values can be adjusted based on the use case (i.e.,) if you expect higher cpu load on the system, CPU resource req and timeout values can be adjusted accordingly. We can document this.

laupow commented 3 years ago

@jayanthvn thanks, case opened (8817394681)

From logs looks like the container can't communicate with the cluster at boot, and therefore fail healthchecks:

{"level":"info","ts":"2021-09-01T18:06:51.513Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.8.0-eksbuild.1  ..."}
{"level":"info","ts":"2021-09-01T18:06:51.514Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-09-01T18:08:24.001Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.8.0-eksbuild.1  ..."}
{"level":"info","ts":"2021-09-01T18:08:24.002Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2021-09-01T18:08:24.010Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.19+. git version: v1.19.13-eks-8df270. git tree state: clean. commit: 8df2700a72a2598fa3a67c05126fa158fd839620. platform: linux/amd64"}
{"level":"warn","ts":"2021-09-01T18:08:27.030Z","caller":"awssession/session.go:64","msg":"HTTP_TIMEOUT env is not set or set to less than 10 seconds, defaulting to httpTimeout to 10sec"}
{"level":"debug","ts":"2021-09-01T18:08:27.031Z","caller":"ipamd/ipamd.go:298","msg":"Discovered region: us-east-1"}
{"level":"info","ts":"2021-09-01T18:08:27.031Z","caller":"ipamd/ipamd.go:298","msg":"Custom networking false"}
{"level":"debug","ts":"2021-09-01T18:08:27.032Z","caller":"awsutils/awsutils.go:357","msg":"Found availability zone: us-east-1f "}
jayanthvn commented 3 years ago

Thanks @laupow

Can you please check kube-proxy pod logs and confirm if you are seeing logs similar to below -

E0905 23:12:40.388938       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:41.516857       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:43.567271       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:48.167166       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:12:56.325941       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
E0905 23:13:14.684106       7 node.go:124] Failed to retrieve node info: nodes "ip-*" not found
laupow commented 3 years ago

Can you please check kube-proxy pod logs and confirm if you are seeing logs similar to below -

Unfortunately I don't see those logs on any nodes with crashed CNI pods. I only see Successfully retrieved node IP. I attached full logs to the ticket.

deva1987m commented 3 years ago

Hi, Do we see similar issue in AWS EKS 1.21 also ?

vioan commented 3 years ago

Any update on this issue? We see the same problem. Will updating to 1.21 fix it?

sstoyanovucsd commented 3 years ago

@achevuru, the default liveness probe timeout of 1s is not enough on any cluster that runs burstable workloads allowed to scale up beyond the resources of a node. Maybe the default timeout value should be adjusted? It is also possible that bumping up resources.requests.cpu to more than 10m cores will be sufficient to allow the probe not to be starved of CPU resources and to always finish within the default 1 second.

jayanthvn commented 3 years ago

@laupow - sure we will check in the ticket if kube-proxy is taking time to start leading to aws-node to restart.

@deva1987m and @vioan - Sorry this thread is deviated into 2 issue as mentioned above. Are you referring to OOM killed (137) of aws-node pods? If so can you please check CPU utilization on the instance. Also I feel we need to set cpu and Memory limits for aws-node.

sarbajitdutta commented 3 years ago

@jayanthvn When cluster-autoscaler scales up Readiness probe fails for aws-node. Its stuck here -

I0907 16:40:46.644131      10 request.go:621] Throttling request took 1.045715446s, request: GET:https://172.20.0.1:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s

The successful ones have this log messages -

{"level":"info","ts":"2021-09-07T16:40:49.507Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-09-07T16:40:49.511Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-09-07T16:40:49.512Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

Error message - Readiness probe failed: {"level":"info","ts":"2021-09-07T16:46:36.839Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"} This causes the node to be in NotReady status forever. I am using latest version of cni and k8 version is 1.20

AlKapkone commented 3 years ago

Hello, any updates on this issue? Please, fix it ASAP! Thanks!

ilcapatina commented 3 years ago

Hi,

We had this issue when we upgraded to EKS 1.20 and I will explain our use case and how we were able to mitigate the problem.

As stated by other people in this thread, the issue started because the exec probes timeouts are honored since 1.20 . We increased the probes the timeout to 5s, 10s, or 30s, and the number of restarts dropped but there were still a few.

After further investigation, we found that the node roof EBS volumes are suffering from micro bursting (see https://aws.amazon.com/premiumsupport/knowledge-center/ebs-identify-micro-bursting/ ) during large docker image pull, causing all the running processes to be stuck for a few seconds. To mitigate this issue, we mounted another drive on the node to use it only for docker image pull. Since then, the vpc cni containers were not restarted anymore.

As I said in the beginning, this is specific to our use case.

cam3ron2 commented 3 years ago

Any update on this issue? We see the same problem. Will updating to 1.21 fix it?

We are on 1.21 and see the same issue.

achevuru commented 3 years ago

@sstoyanovucsd Yes, as I said 1s will not be enough when there are workloads running on the node(s) that can use up all the CPU resources (periodically) essentially starving other processes on the node. CPU resource req. and timeout values can be adjusted accordingly for those use cases. 1s is the default value upstream K8S picks for this probe and we don't set any specific default value for liveness probe timeout in aws-node manifest. As you know, this is a node local probe (no network latency involved) and increasing the default value for all the clusters to address this use case might not be the best thing.

@cam3ron2 @AlKapkone Can you elaborate on the exact issue/behavior you're seeing? There are 2 issues being discussed in this thread..If it is due to liveness probe failures because of some burstable workloads on the node using up all the CPU resources, you can adjust the default timeout value for the Liveness probe as others have described above.

blodan commented 3 years ago

Raising timeoutSeconds to 5s on the health probes worked great for us, no more restarts.

Our avereage node cpu load is 70% in our cluster and running v1.21.2-eks-55daa9d with amazon-k8s-cni:v1.9.0-eksbuild.1

sstoyanovucsd commented 3 years ago

@blodan, 5s works for us as well. I'll run some tests with lower values to see what is the minimum we can go without aws-node restarting.

@achevuru, I understand you are hesitant to adjust a value to address a problem a few of us are experiencing, but maybe the problem is more wide-spread. Would AWS consider at least making the timeout setting an add-on configuration parameter (like attachPolicyARNs for example) so that it can be customized during add-on install/update and we are not forced to patch the daemonset outside of the EKS add-on lifecycle.

artificial-aidan commented 3 years ago

I'm also seeing this problem. With the new IP prefixing we are running a much higher pod density which is causing a ton of liveness probe issues. (Not just for aws-node).

I wrote a MutatingAdmissionWebhook to force all pods to have a higher liveness probe timeout.

jayanthvn commented 3 years ago

@sstoyanovucsd Yes we will surely see how we can improve the user experience for this.

sstoyanovucsd commented 3 years ago

Thank you @jayanthvn. Just an FYI, we updated our vpc-cni plugin to v1.9.1-eksbuild.1, which was released recently and for the test reverted the liveness/readiness timeout to 1s. We haven't experienced any aws-node pod restarts since then (4 days now). Hopefully it was a bug in 1.9.0 that caused the intermittent failures and it was fixed in 1.9.1. FWIW our cluster is running 1.20.

vioan commented 3 years ago

@deva1987m maybe as @sstoyanovucsd suggested, updating the vpc-cni plugin to v1.9.1-eksbuild.1 will solve the issue.

achevuru commented 3 years ago

@sstoyanovucsd Agree, it might be better to make this a configurable parameter. We will look in to it.

blodan commented 3 years ago

Little update from us; While running with timeoutSeconds: 5s on the probes removed most of the restarts, we were still seing restarts 2-3 times every 24 hours.

But, after upgrading to v1.9.1 we started seing this in the output

Readiness probe failed: {"level":"info","ts":"2021-09-27T11:40:37.577Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

Changing the actual health commands timeout made the last restarts go away entirely now running for 48 hours without restarts We are now running with probe timeout 5s and command timeout 3s

          exec:
            command:
            - /app/grpc-health-probe
            - -addr=:50051
            - -connect-timeout=3s
            - -rpc-timeout=3s
          timeoutSeconds: 5
jkroepke commented 3 years ago

@artificial-aidan

You you share your AdmissionController?

Do I need to modify a AWS managed resource? Does someone got success?

I have no idea, why it stuck on

{"level":"info","ts":"2021-10-05T13:14:59.879Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-10-05T13:14:59.897Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-10-05T13:14:59.903Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}

Is exit code 137 a OOM in general?

From what I know, 137 means, the project got killed with SIGKILL. This also can be happens, if the container can not be exiting inside the terminationGracePeriodSeconds period.

fubar commented 3 years ago

Hi again @jayanthvn @achevuru, I am experiencing many of the problems described in this thread. In my case I have 2 nodes in the cluster, a t3a.medium (AMD) and a t4g.medium (ARM) and the problems occur only on the ARM instance:

Liveness probe failed: {"level":"info","ts":"2021-10-20T19:43:27.910Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 2s"}
Readiness probe failed: {"level":"info","ts":"2021-10-20T19:43:17.875Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 2s"}
terminated
Reason: Error - exit code: 137

Observations:

Logs of the successfully started pod (AMD):

{"level":"info","ts":"2021-10-20T19:56:41.523Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-10-20T19:56:41.524Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-10-20T19:56:41.545Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-10-20T19:56:41.548Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2021-10-20T19:56:45.580Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-10-20T19:56:45.585Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-10-20T19:56:45.586Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

Logs of the failing pod (ARM):

{"level":"info","ts":"2021-10-20T20:19:12.842Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-10-20T20:19:12.843Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-10-20T20:19:12.853Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-10-20T20:19:12.861Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}

These lines are repeated with every restart, ie. it seems to get stuck at the same step every time.

Let me know if I can provide more details. Thanks!

fubar commented 3 years ago

In case it helps, I have tried CNI add-on versions 1.9, 1.8, and 1.7, and they all have the same behaviour/failure. EKS is on version 1.21.4.

Eyjafjallajokull commented 3 years ago

In my case, I was running CNI 1.7.5 on EKS 1.20, after updating CNI to 1.9.3 problem got solved. 🤞

vikasmb commented 3 years ago

@fubar I created EKS 1.21 cluster with same worker node instance type of "t4g.medium" and nodes became ready automatically.

aws-node pod logs from my test cluster:

kubectl -n kube-system logs -f aws-node-<id>
{"level":"info","ts":"2021-10-21T00:03:13.685Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-10-21T00:03:13.695Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-10-21T00:03:13.696Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2021-10-21T00:03:15.708Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-10-21T00:03:15.710Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-10-21T00:03:15.711Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

Could you share the logs mentioned in "Troubleshooting CNI/ipamD at node level" in https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md

fubar commented 3 years ago

I figured it out. Our EKS cluster has been up and running for a while and kube-proxy needed to be updated and configured to include the arm64 architecture selector as described on point 3 under "Updating the kube-proxy self-managed add-on" at https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html#updating-kube-proxy-add-on. After that, aws-node started without problem.

While I was at it, CoreDNS was outdated as well, and I followed the steps at https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html#updating-coredns-add-on. Note that this page is missing the addition of the arm64 architecture selector which I added via kubectl edit -n kube-system deployment.apps/coredns (I notified AWS about this).

Thanks all!

vikasmb commented 3 years ago

Thanks for the update @fubar.

sarbajitdutta commented 3 years ago

@jayanthvn @vikasmb I have hit the problem again. When cluster-autoscaler scaled nodes today the newly created Node got stuck in Not Ready state forever on account of aws-node keeps on restarting citing Readiness probe failed.

I am running the latest version of the cni addon. I have not yet tried to increase the timeouts in both probes in aws-node Daemonset as suggested by others. Also cni build version is 1.9.1

Affected aws-node pod logs -

Readiness probe failed: {"level":"info","ts":"2021-10-28T14:04:33.971Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
{"level":"info","ts":"2021-10-28T13:59:58.590Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-10-28T13:59:58.600Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-10-28T13:59:58.655Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-10-28T13:59:58.657Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I1028 13:59:59.842643      10 request.go:621] Throttling request took 1.031323086s, request: GET:https://172.20.0.1:443/apis/autoscaling/v2beta1?timeout=32s

image

jabdoa2 commented 2 years ago

Same issue here on Kubernetes 1.20 using kops. Under moderate load (50-70% cpu load) aws-node restarts sporadically. Depending on cluster size there can be a few hundred restarts per day.

tomiszili commented 2 years ago

I have this issue on t3a.medium managed node groups on an eks 1v21 cluster, I got the first error on cni v1.7.5-eksbuild.1 then i tried v1.9.3 then tried with increased timeout second as above mentioned, all of these method I cannot fix this issue.

Then I terminated all of my managed instances, and now only the kube-proxy started.

This is the describe output for the aws-node, after the manual termination on the instances:

Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  5m18s                   default-scheduler  Successfully assigned kube-system/aws-node-w8dbp to ip-10-103-44-118.eu-central-1.compute.internal
  Normal   Pulling    5m15s                   kubelet            Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
  Normal   Pulled     5m4s                    kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1" in 11.134114629s
  Normal   Created    5m4s                    kubelet            Created container aws-vpc-cni-init
  Normal   Started    5m4s                    kubelet            Started container aws-vpc-cni-init
  Normal   Started    5m                      kubelet            Started container aws-node
  Normal   Pulled     5m                      kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 3.376454897s
  Normal   Killing    3m34s                   kubelet            Container aws-node failed liveness probe, will be restarted
  Normal   Pulling    3m24s (x2 over 5m4s)    kubelet            Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
  Warning  Unhealthy  3m24s (x10 over 4m54s)  kubelet            Readiness probe failed:
  Normal   Pulled     3m24s                   kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 145.075431ms
  Normal   Created    3m23s (x2 over 5m)      kubelet            Created container aws-node
  Warning  Unhealthy  14s (x9 over 3m54s)     kubelet            Liveness probe failed:

So now i have a useless cluster, how can i make it to function properly?

jayanthvn commented 2 years ago

@tomiszili - Can you please open a support ticket and we can setup a call to debug your cluster?