aws / aws-network-policy-agent

Apache License 2.0
45 stars 29 forks source link

VPC CNI plugin crashing when enabling cloudwatch logs for network policy logs #141

Closed mahasiva-amazon closed 8 months ago

mahasiva-amazon commented 11 months ago

What happened:

  1. Created a cluster with VPC CNI Plugin with network policy true.
  2. Added permission to the service role to enable CloudWatch logging as defined here. (https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html)
  3. The used eks update-on cli to enable CloudWatch logging

aws eks update-addon --cluster-name ${EKS_CLUSTER_NAME} --addon-name "vpc-cni" --configuration-values '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true", "nodeAgent": { "enableCloudWatchLogs": "true", "healthProbeBindAddr": "8163", "metricsBindAddr": "8162"}}'

  1. Post this command, the aws-node daemonset pods start crashing and futher analysis looks like the aws-node-agent containers in the pod are crashing. The issue does not go away even if we delete the add-on and again install it.

Attach logs

Normal Scheduled 52s default-scheduler Successfully assigned kube-system/aws-node-45nmc to ip-XXXX.us-west-2.compute.internal Normal Pulling 52s kubelet Pulling image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.14.1-eksbuild.1" Normal Pulled 49s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.14.1-eksbuild.1" in 2.696970025s (2.696982854s including waiting) Normal Created 49s kubelet Created container aws-vpc-cni-init Normal Started 49s kubelet Started container aws-vpc-cni-init Normal Pulling 48s kubelet Pulling image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.14.1-eksbuild.1" Normal Pulled 46s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.14.1-eksbuild.1" in 1.550764534s (1.550796824s including waiting) Normal Created 46s kubelet Created container aws-node Normal Started 46s kubelet Started container aws-node Normal Pulling 46s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.2-eksbuild.1" Normal Pulled 33s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.2-eksbuild.1" in 13.02422571s (13.02424398s including waiting) Normal Created 33s kubelet Created container aws-eks-nodeagent Normal Started 33s kubelet Started container aws-eks-nodeagent Warning Unhealthy 28s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:18.910Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"} Warning Unhealthy 23s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:23.969Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"} Warning Unhealthy 17s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:29.021Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"} Warning Unhealthy 12s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:34.077Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"} Warning Unhealthy 7s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:39.591Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"} What you expected to happen:

  1. The add-on to be updated with correct logging configuration. How to reproduce it (as minimally and precisely as possible): Refer earlier section Anything else we need to know?: N/A Environment:
    • Kubernetes version (use kubectl version): 1.27
    • CNI Version - v1.15.3-eksbuild.1
    • Network Policy Agent Version - v1.01
    • OS (e.g: cat /etc/os-release): Amazon Linux
    • Kernel (e.g. uname -a): Linux ..... 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
ariary commented 11 months ago

I've faced the same issue. Documentation is not clear about this point. By looking at /var/log/aws-routed-eni/ipamd.log on the node it seems to be an authorization issue:

{"level":"error","ts":"2023-12-06T13:51:53.616Z","caller":"ipamd/ipamd.go:457","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-03****** eni-07********]: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: df464bdf-eb18-4b85-*******"}
{"level":"error","ts":"2023-12-06T13:51:53.727Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: df464bdf-****"}

I have resolved it by adding the permissions AmazonEKS_CNI_Policy to my role

jaydeokar commented 11 months ago

Hi @ariary Can you give more details as to how you ended up with the issue ? Which role did you ended up adding the permission to (the node role or the CNI-addon role) ? The above issue happened since the create/update addon call did not pass the service-role-arn to use for CNI

ariary commented 11 months ago

I have created a specific role with permissions for the policy I mentioned above + the one which is defined in the documentation (for cloud watch log) For this role I check that aws-node service account can assume it (cf trust relationship in UI) Then you can update your add-on by specifying the adding-role arn (—service-account-role-arn)

Note also, that to get logs you also need in your node agent configuration "enablePolicyLogs": "true"

jaydeokar commented 11 months ago

I have created a specific role with permissions for the policy I mentioned above + the one which is defined in the documentation

So if I understand this correct.. You created a new role and added cloudwatch log policy to the role for network policy logs. CNI then complained about not having the right authorization, which is when you added the AmazonEKS_CNI_POLICY ?

ariary commented 11 months ago

Exactly

jaydeokar commented 11 months ago

Thanks for the details.. So we do recommend to add the cloudwatch log policy to the existing CNI IAM role (which would already have the AmazonEKS_CNI_Policy attached). This is also being called out in the prerequisites section of the docs here..

https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html#network-policies-troubleshooting Add the following permissions as a stanza or separate policy to the IAM role that you are using for the VPC CNI.

Let me know if this helps

ariary commented 11 months ago

@jaydeokar indeed! Just maybe it would be helpful to specify which role we are talking about, as if we are using "default" configuration we have Service account role:Inherited from node. Thus leading to create a new role with only the policy mentioned.

micolun commented 10 months ago

I experience the same issue, I cannot enable cloudwatch logs. The aws-node-agent falls into crash loopback. My VPC-CNI configs {"enableNetworkPolicy":"true","nodeAgent":{"enableCloudWatchLogs":"true"}} My VPC-CNI version v1.15.0-eksbuild.2 My EKS version 1.28 I tried assigning IAM permissions directly to Addon and inherited from kubernetes instances, same result. I used arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy This is the only log message I get in aws-eks-nodeagent container {"level":"info","ts":"2024-01-03T17:16:14Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

When I manually disable cloudwatch by editing aws-node daemonset and overwriting the cloudwatch switch it starts working --enable-cloudwatch-logs=false here is the generated manifest for vpc-cni-driver manifest aws-node.yaml.txt

jaydeokar commented 10 months ago

Hi @Mihail-blip The accept/deny logs should be available in /aws/eks/<cluster-name>/cluster cloudwatch. We don't log anything in the stdout for aws-eks-nodeagent container. Also make sure you have { "nodeAgent": {"enablePolicyEventLogs": "true"} in order for the agent to start logging the accept/deny logs.

jdn5126 commented 8 months ago

There do not seem to be any open items on this issue, so closing as resolved

avgKol commented 8 months ago

@Mihail-blip , you need to include the CloudWatch permissions in your IAM role (https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html#cni-iam-role-create-role) or in the IAM role for EKS nodes. Additionally, make sure to configure { "nodeAgent": {"enablePolicyEventLogs": "true"} } (https://github.com/aws/aws-network-policy-agent/issues/129).