Getting Failed to watch of *v1alpha1.PolicyEndpoint ended with: an error on the server after upgrading VPC CNI to v1.17.1+ version with aws-network-policy-agent v1.1.0

ArtemProskochylo commented 5 months ago

What happened: After upgrading vpc-cni plugin to v1.17.1 and v1.18.0 versions I see a lot of errors for the aws-network-policy-agent container with v1.1.0 version. The issue is occurring even on fresh EKS installations where we are not using Network Policies.

Attach logs W0424 08:27:34.397257 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

What you expected to happen: No error messages.

How to reproduce it (as minimally and precisely as possible):

Deploy v1.29 EKS cluster
Deploy VPC CNI Add-on v1.17.1-eksbuild.1 or v1.18.0-eksbuild.1 version.
Run kubectl -n kube-system logs aws-node-*

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Client Version: v1.29.1 Server Version: v1.29.1-eks-b9c9ed7
CNI Version: v1.17.1 and v1.18.0
Network Policy Agent Version: v1.1.0
OS (e.g: cat /etc/os-release): Bottlerocket OS 1.19.2 (aws-k8s-1.29)
Kernel (e.g. uname -a): 6.1.77

achevuru commented 4 months ago

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

danielap-ma commented 3 months ago

Facing the same issue after upgrading to EKS 1.29 with CNI 1.18.0. @achevuru I upgraded the addon directly from AWS using Terraform. I checked the ClusterRole configuration and it has the permissions you referred to:

apiGroups:
- networking.k8s.aws resources:
- policyendpoints verbs:
- get
- list
- watch

Seems like a bug.

achevuru commented 3 months ago

@danielap-ma If you're seeing the same error as above - then either the permissions are missing (please check if CNI pods have correct SA in place) or there are connectivity issues with your API Server. I quickly tried it and I don't see any such issue(s) on my end.

ArtemProskochylo commented 3 months ago

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

Hi @achevuru Sorry for the late response. It was also updated through Terraform. But in my case only add-on version was set through Terraform, configmaps, daemonset and other resources are managed by AWS. I have checked RBACs for vpc-cni v1.17.1 and required permissions are presented there: `- apiGroups:

networking.k8s.aws resources:
policyendpoints verbs:
get
list
watch
- apiGroups:
networking.k8s.aws resources:
policyendpoints/status verbs:
get`

But I still see the following error in logs for v1.17.1: W0509 03:34:41.481449 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

In another cluster running the updated version v1.18.1, I do not see those errors. I suppose it is a version-specific issue.

I hope provided info will be useful for you.

Thanks

omfurman-ma commented 2 months ago

In another cluster running the updated version v1.18.1, I do not see those errors. I suppose it is a version-specific issue.

Hey @achevuru, Working with @danielap-ma on this issue. We still see these errors even though the CNI pods have the right SA, as Daniel wrote in the above comment. Anything we can do to overcome these errors?

maiconrocha commented 2 months ago

Hi @omfurman-ma @danielap-ma , can you please ensure you have eks:addon-cluster-admin ClusterRoleBinding deployed into your cluster? if not, please follow solution provided on https://repost.aws/questions/QUEAwOTFmCTLG-SzJQOhkx3w/accessdenied-when-create-ebs-csi-driver

aws / aws-network-policy-agent

Getting Failed to watch of *v1alpha1.PolicyEndpoint ended with: an error on the server after upgrading VPC CNI to v1.17.1+ version with aws-network-policy-agent v1.1.0 #257