aws / aws-network-policy-agent

Apache License 2.0
42 stars 22 forks source link

`amd64` binary wrongly copy into `arm64` image, causing Pods fall into `CrashLoopBackoff` state #244

Closed guessi closed 2 months ago

guessi commented 3 months ago

What happened:

Seeing asm_amd64.s shown in arm64 image, which should be asm_arm64.s

Attach logs n/a

What you expected to happen: Pods on Graviton nodes should be RUNNING but not CrashLoopBackoff

How to reproduce it (as minimally and precisely as possible):

  1. Start agent with --enable-policy-event-logs=true set.

  2. Seeing CrashLoopBackoff and identify it was run on Graviton node

$ kubectl -n kube-system get pods -l k8s-app=k8s-node
kube-system   aws-node-qxld8             1/2     CrashLoopBackOff   6 (4m4s ago) ...
  1. Log into node and check for image
# /usr/local/bin/nerdctl -n k8s.io image inspect 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-eksbuild.1 | grep 'Architecture'
    "Architecture": "arm64", # <----------- I can see the image is "arm64".
  1. Check for error log for aws-eks-nodeagent
$ kubectl -n kube-system logs -f aws-node-qxld8 -c aws-eks-nodeagent
{"level":"info","ts":"2024-04-07T08:13:07.999Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
                                                                  ^^^^^^^^^ But here, is it normal to see "AMD64" here?
  1. Possibly be missing cross-arch build file copy in Dockerfile or Makefile
  1. Removing --enable-policy-event-logs=true or set as false then Pods should back to RUNNING state.

Anything else we need to know?:

Environment:

guessi commented 3 months ago

fyi, it was originally showing asm_arm64.s but for some reason, it just break!

Where you can see the error log from #135 showing asm_arm64.s in log lines.

{"level":"info","ts":"2023-11-09T12:32:26.065Z","caller":"runtime/asm_arm64.s:1197","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
achevuru commented 3 months ago

@guessi Are you saying the crash only happens if you set enable-policy-event-logs to true? If the release is incorrectly using an amd image on arm nodes, it should fail always and shouldn't be tied to one of the custom env variables.

Are you seeing this behavior with the latest VPC CNI version? Was it working fine with prior releases on your setup?

guessi commented 2 months ago

@achevuru maybe I should address more details

TL;DR

It doesn't matter if the flag is set or not, it's more about where it run, what the arch for the node.

Full story

Tested with the following combinations

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}

Just follow the doc with Graviton node running, and you should know what I said,

I believe you should easily to reproduce CrashLoopBackoff loop for arm64 nodes.

The message I provided was the minimum setup to reproduce the issue.

achevuru commented 2 months ago

@guessi Understood, but my Q was more on the below statements from you..

- Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, running with arm64 nodes, Pods will always stuck in CrashLoopBackoff state.

So, it appears NP agent is working fine for you if enable-policy-event-logs flag is not set even on Graviton instances. If true, then this should not be tied to an incorrect arch binary used on arm nodes. Flag you're setting is just to enable logs and nothing to do with Network Policy functionality.

Anyways, we will also try it and let you know.

achevuru commented 2 months ago

Synced up internally with @guessi and the above issue is due to missing Cloudwatch perms as the cluster also had enable-cloudwatch-logs set. Issue resolved itself once the relevant permissions were provided. We will look for better ways to expose the error message to the end user. Right now, NP agent logs will show 403s against Cloudwatch APIs

guessi commented 2 months ago

@achevuru Thanks for update, I could now narrow down the issue to the difference between the setup below,

The working one:

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true"}}`

Not working one:

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}

Further deep dive into the issue, I found it's IAM policy setup issue

After adding missing IAM Policies, everything works as expected.

Post-incident suggestions

By following the guidance HERE, the IAM Policy setup in the doc is now "after" the step to enable enableCloudWatchLogs but not "before" ( It should be mentioned before it is enabled ). It's really hard to identify the issue with no log emit.