aws-eks-nodeagent container logs errors on startup and shutdown

rtomadpg commented 10 months ago

What happened:

After upgrading VPC-CNI from v1.14.1-eksbuild.1 to v1.15.4-eksbuild.1 all the aws-eks-nodeagent containers logged:

aws-node-np4cq aws-eks-nodeagent 2023-12-06 16:14:59.823264484 +0000 UTC Logger.check error: failed to get caller

And, when I delete a random aws-node pod, I see this:

aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131300614 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131410269 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131480895 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131594396 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131647113 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131669285 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131694685 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.13179858 +0000 UTC Logger.check error: failed to get caller

I believe these errors comes from the uber-go/zap dependency, see https://github.com/uber-go/zap/blob/5acd569b6a5264d4c7433cbb278a8336d491715c/logger.go#L398

As I am unsure this error is signalling something is (really) wrong and this error was not logged in this project yet, I created the bug.

Attach logs

Let me know if needed.

What you expected to happen:

No errors getting logged.

How to reproduce it (as minimally and precisely as possible):

Upgrade to the mentioned version
Check the aws-node pod logs
Or, delete a aws-node pod. New pod will log the errors.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.27.7-eks-4f4795d
CNI Version: v1.15.4-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2

Kernel (e.g. uname -a):

Linux <hostname redacted> 5.10.192-183.736.amzn2.x86_64 aws/amazon-vpc-cni-k8s#1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

jdn5126 commented 10 months ago

@rtomadpg just curious, did you notice the comment with:

For Network Policy issues, please file at https://github.com/aws/aws-network-policy-agent/issues

when you opened this issue? We are trying to improve the experience here with triaging Network Policy agent issues, so I am wondering if you think there is a better way this could have been noticed.

jdn5126 commented 10 months ago

As for this issue, this is the same as https://github.com/aws/aws-network-policy-agent/issues/103. This error log is harmless, and a fix is in progress

rtomadpg commented 10 months ago

Ouch, so sorry! I checked the new bug flow and indeed that comment is there. Very clearly. I guess I was too eager to file the bug (end of work day here) and I overlooked that part.

rtomadpg commented 10 months ago

@jdn5126 maybe a suggestion: when errors are logged by a container named "aws-eks-nodeagent" it's not immediately clear that's related to "Network Policy issues" or "aws-network-policy-agent". Maybe a mention of "aws-eks-nodeagent" in that comment will reduce wrongly filed issues?

jdn5126 commented 10 months ago

Ouch, so sorry! I checked the new bug flow and indeed that comment is there. Very clearly. I guess I was too eager to file the bug (end of work day here) and I overlooked that part.

Oh no worries, I was just curious if there was a better setup through GitHub. Good call, I can expand the comment

lsabreu96 commented 7 months ago

Hi everyone, sorry jumpin in on a closed thread.

I'm facing the same issue, but without the network policy error mentioned here. I'm tryint to upgrade a managed worker group to 1.25 but the aws-node daemonset keeps failing in aws-eks-nodeagent container, causing the pod to restart

Any ideas ? The VPC CNI plugin version is on v1.15.1-eksbuild.1

jdn5126 commented 7 months ago

@lsabreu96 the error log from this issue is harmless. If you are seeing the aws-eks-nodeagent container crashing, please file a new issue with the logs from the crash, which you can find in /var/log/aws-routed-eni/network-policy-agent.log on the affected node.

koenkarsten commented 7 months ago

For anyone reaching this thread because the aws-eks-nodeagent container is crashing with UTC Logger.check error: failed to get caller: For me the issue was mixing EKS k8s version 1.24 with aws-network-policy-agent:v1.0.4-eksbuild.1 and amazon-k8s-cni:v1.15.1-eksbuild.1 (These versions were automatically provisioned by EKS). Upgrading to k8s version 1.25 fixes the container crashing loop, as mentioned on the README of this repo (You’ll need a Kubernetes cluster version 1.25+ to run against.).

So not commenting to reopen this issue, just provide information if anyone still running 1.24 lands here!

aws / aws-network-policy-agent

aws-eks-nodeagent container logs errors on startup and shutdown #162