aws / aws-network-policy-agent

Apache License 2.0
42 stars 23 forks source link

"Check error: failed to get caller" error after VPC CNI plugin version upgrade (V1.18.0) #247

Closed madhavpersistent closed 1 month ago

madhavpersistent commented 3 months ago

Issue Description: After upgrading the VPC CNI plugin from v1.15.0 to v1.18.0 in our EKS cluster (upgraded from version 1.26 to 1.27), we are encountering an issue with one of the init containers. The container fails with the following error: “check error: failed to get caller”. This issue persists despite claims that it was addressed in a recent GitHub pull request.

Previous Interaction: This issue is similar to one we experienced previously during an update from CNI version v1.15.0 to v1.16.0 while upgrading the EKS Cluster from version 1.25 to 1.27. The problem was supposedly resolved according to AWS support referencing GitHub pull request #168.

Current Problem Despite the resolution mentioned in the GitHub pull request, the error is reoccurring in the latest upgrade scenario.

Steps to Reproduce

  1. Upgrade EKS cluster from version 1.27 to 1.28.
  2. Upgrade VPC CNI plugin from v1.15.0 to v1.18.0.
  3. Observe the initiation of containers.

Expected Behavior: The init containers should start without any errors post upgrade.

Actual Behavior: One of the init containers fails to start, logging the following error: “check error: failed to get caller.”

Additional Information • EKS Cluster Version: 1.28 • VPC CNI Plugin Version: v1.18.0 • Error Logs: "{"level":"info","ts":"2024-04-0914:08:52.66Z","caller":"metrics/metrics.go:23","msg":"Serving metrics on ","port":6160} 2024-04-09 14:08:52.6406 +0000 UTC Logger.check error: failed to get caller"

Questions

  1. Is the VPC CNI plugin version v1.18.0 fully compatible with EKS version 1.28?
  2. Was the issue resolved in PR #168 supposed to cover this scenario?
  3. What could be the reason for this error recurring?
achevuru commented 3 months ago

@madhavpersistent Can you elaborate on container fails with the following error: “check error: failed to get caller”.? What is failing? Is CNI pod not moving to Running state? Above log message shouldn't contribute to any functionality issue and more of a false flag that we need to address..

xamroc commented 2 months ago

We get these error logs as well.

@achevuru You are correct that these log messages do not contribute to any functionality issues. However, it is an issue for us when we ship container logs to our observability platform. These error logs are printed so frequently that it causes observability costs to explode.

We opted to keep this option turned off (not ideal) because of that.

achevuru commented 2 months ago

@xamroc You should ideally see these logs just once during bootup. If you're seeing them frequently, please check if aws-eks-nodeagent container is constantly restarting for some reason in your cluster

xamroc commented 2 months ago

@achevuru It isn't. The aws-node pod is running without restarts. Both containers inside it are running fine as well. They just constantly log Logger.check error: failed to get caller.

I've already worked with AWS Support on this and they can provide those details. They've captured the logs from our nodes as well if that helps.

connorharkness95 commented 1 month ago

We are also experiencing the same issue, steps below. Do we have a fix on the way or a workaround available?

Upgrade EKS cluster from version 1.27 to 1.28. Upgrade VPC CNI plugin from v1.15.0 to v1.18.0. Observe the initiation of containers. Expected Behavior: The init containers should start without any errors post-upgrade.

Actual Behavior: One of the init containers fails to start, logging the following error: “check error: failed to get caller.”

jayanthvn commented 1 month ago

This issue is fixed with this PR - https://github.com/aws/aws-network-policy-agent/pull/254. We are working on the release and should have the released image by this week

jayanthvn commented 1 month ago

@connorharkness95 - Are you seeing the error with init container and not aws-eks-nodeagent?

Siy007 commented 1 month ago

@jayanthvn is the fix released?

jayanthvn commented 1 month ago

Yes the fix is released with latest network policy agent - 1.1.2 - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2