Closed guessi closed 2 months ago
fyi, it was originally showing asm_arm64.s
but for some reason, it just break!
Where you can see the error log from #135 showing asm_arm64.s
in log lines.
{"level":"info","ts":"2023-11-09T12:32:26.065Z","caller":"runtime/asm_arm64.s:1197","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
@guessi Are you saying the crash only happens if you set enable-policy-event-logs
to true
? If the release is incorrectly using an amd image on arm nodes, it should fail always and shouldn't be tied to one of the custom env variables.
Are you seeing this behavior with the latest VPC CNI version? Was it working fine with prior releases on your setup?
@achevuru maybe I should address more details
It doesn't matter if the flag is set or not, it's more about where it run, what the arch for the node
.
aws-eks-nodeagent
image build.x86_64
nodes, Pods could turn to RUNNING
state with no issue.arm64
nodes, Pods will always stuck in CrashLoopBackoff
state.Tested with the following combinations
eks.15
, eks.17
)
eks.17
is the latest platform version of Amazon EKS 1.25 and it should met the minimum requirements state HERE.v1.15.1-eksbuild.1
, v1.15.5-eksbuild.1
, v1.16.4-eksbuild.2
, ...
v1.15.1-eksbuild.1
with no flag set, trying to upgrade one minor version at a time.x86_64
node could successfully spawned, but all arm64
node failed.{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}
t3a
t4g
CNINode
defined yet.SecurityGroupPolicy
defined yet.NetworkPolicy
defined yet.Just follow the doc with Graviton node running, and you should know what I said,
I believe you should easily to reproduce CrashLoopBackoff
loop for arm64
nodes.
The message I provided was the minimum setup to reproduce the issue.
@guessi Understood, but my Q was more on the below statements from you..
- Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, running with arm64 nodes, Pods will always stuck in CrashLoopBackoff state.
So, it appears NP agent is working fine for you if enable-policy-event-logs
flag is not set even on Graviton instances. If true, then this should not be tied to an incorrect arch binary used on arm nodes. Flag you're setting is just to enable logs and nothing to do with Network Policy functionality.
Anyways, we will also try it and let you know.
Synced up internally with @guessi and the above issue is due to missing Cloudwatch perms as the cluster also had enable-cloudwatch-logs
set. Issue resolved itself once the relevant permissions were provided. We will look for better ways to expose the error message to the end user. Right now, NP agent logs will show 403s against Cloudwatch APIs
@achevuru Thanks for update, I could now narrow down the issue to the difference between the setup below,
The working one:
{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true"}}`
Not working one:
{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}
Further deep dive into the issue, I found it's IAM policy setup issue
After adding missing IAM Policies, everything works as expected.
By following the guidance HERE, the IAM Policy setup in the doc is now "after" the step to enable enableCloudWatchLogs
but not "before" ( It should be mentioned before it is enabled ). It's really hard to identify the issue with no log emit.
What happened:
Seeing
asm_amd64.s
shown inarm64
image, which should beasm_arm64.s
Attach logs n/a
What you expected to happen: Pods on Graviton nodes should be
RUNNING
but notCrashLoopBackoff
How to reproduce it (as minimally and precisely as possible):
Start agent with
--enable-policy-event-logs=true
set.Seeing
CrashLoopBackoff
and identify it was run on Graviton nodeaws-eks-nodeagent
--enable-policy-event-logs=true
or set asfalse
then Pods should back toRUNNING
state.Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):