Closed aballman closed 1 month ago
https://github.com/aws/aws-network-policy-agent/issues/183 Seems similar to my issue but I'm using the release candidate version that's referenced and reported as having fixed that particular issue.
@aballman - v1.0.8-rc3
is the latest. We hit a similar issue where the maps got wrongly updated. Can you please try v1.0.8-rc3
?
@aballman -
v1.0.8-rc3
is the latest. We hit a similar issue where the maps got wrongly updated. Can you please tryv1.0.8-rc3
?
Thanks! I'll give it a shot
Unfortunately this did not resolve my issue. The same problem is present. I've confirmed that i'm on v1.0.8-rc3
on the problem node. I also had rolled all nodes in my cluster ~16h ago when the Bottlerocket 1.19.1 fix was released.
Curiously, it seems to be a similar scenario, where the problem pod was on the node for ~90m.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
argocd-application-controller-0 1/1 Running 36 (65m ago) 16h 10.146.63.42 ip-10-146-62-155.ec2.internal <none> <none>
argocd-applicationset-controller-7974ff9cf9-vjppv 1/1 Running 0 16h 10.146.63.228 ip-10-146-62-155.ec2.internal <none> <none>
argocd-dex-server-5c6dfff575-wrl7v 1/1 Running 0 16h 10.146.53.41 ip-10-146-53-188.ec2.internal <none> <none>
argocd-notifications-controller-778866f977-9nhdd 1/1 Running 0 16h 10.146.60.229 ip-10-146-62-155.ec2.internal <none> <none>
argocd-redis-5bcdf48d96-x8bqp 1/1 Running 0 16h 10.146.62.162 ip-10-146-62-155.ec2.internal <none> <none>
argocd-redis-ha-haproxy-7f84459cf-pmdfv 1/1 Running 0 16h 10.146.56.174 ip-10-146-57-151.ec2.internal <none> <none>
argocd-redis-ha-haproxy-7f84459cf-tcdsr 1/1 Running 0 19h 10.146.54.58 ip-10-146-55-43.ec2.internal <none> <none>
argocd-redis-ha-haproxy-7f84459cf-xs6dp 1/1 Running 0 16h 10.146.53.99 ip-10-146-53-188.ec2.internal <none> <none>
argocd-redis-ha-server-0 3/3 Running 0 16h 10.146.58.63 ip-10-146-57-151.ec2.internal <none> <none>
argocd-redis-ha-server-1 3/3 Running 0 16h 10.146.63.193 ip-10-146-62-155.ec2.internal <none> <none>
argocd-redis-ha-server-2 3/3 Running 0 16h 10.146.52.251 ip-10-146-55-43.ec2.internal <none> <none>
argocd-repo-server-85ccb7dbdd-mkd4k 1/1 Running 0 16h 10.146.54.42 ip-10-146-53-188.ec2.internal <none> <none>
argocd-repo-server-85ccb7dbdd-rvhlm 1/1 Running 0 86m 10.146.62.175 ip-10-146-62-155.ec2.internal <none> <none>
argocd-server-6d6cd7bc6b-mccvn 1/1 Running 0 16h 10.146.54.27 ip-10-146-53-188.ec2.internal <none> <none>
argocd-server-6d6cd7bc6b-pbbkd 1/1 Running 0 19h 10.146.53.5 ip-10-146-55-43.ec2.internal <none> <none>
apiVersion: networking.k8s.aws/v1alpha1
kind: PolicyEndpoint
metadata:
creationTimestamp: "2024-02-02T00:46:35Z"
generateName: argocd-repo-server-
generation: 243
name: argocd-repo-server-sxvj2
namespace: argocd
ownerReferences:
- apiVersion: networking.k8s.io/v1
blockOwnerDeletion: true
controller: true
kind: NetworkPolicy
name: argocd-repo-server
uid: a57fcdb4-d425-4aa4-b818-61c9168debbf
resourceVersion: "150208318"
uid: df6dadb8-e619-4f72-ba98-82618b9f8256
spec:
ingress:
- cidr: 10.146.53.5
ports:
- port: 8081
protocol: TCP
- cidr: 10.146.54.27
ports:
- port: 8081
protocol: TCP
- cidr: 10.146.60.229
ports:
- port: 8081
protocol: TCP
- cidr: 10.146.63.228
ports:
- port: 8081
protocol: TCP
- cidr: 10.146.63.42
ports:
- port: 8081
protocol: TCP
podIsolation:
- Ingress
podSelector:
matchLabels:
app.kubernetes.io/instance: argocd
app.kubernetes.io/name: argocd-repo-server
podSelectorEndpoints:
- hostIP: 10.146.53.188
name: argocd-repo-server-85ccb7dbdd-mkd4k
namespace: argocd
podIP: 10.146.54.42
- hostIP: 10.146.62.155
name: argocd-repo-server-85ccb7dbdd-rvhlm
namespace: argocd
podIP: 10.146.62.175
policyRef:
name: argocd-repo-server
namespace: argocd
ip-10-146-53-188.ec2.internal / aws-node-hntbh
bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : egress
Prog ID: 108
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 19
Map Name: egress_map
Map ID: 29
Map Name: policy_events
Map ID: 20
========================================================================================
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : ingress
Prog ID: 107
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 19
Map Name: ingress_map
Map ID: 28
Map Name: policy_events
Map ID: 20
========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 28
Key : IP/Prefixlen - 10.146.53.5/32
-------------------
Value Entry : 0
Protocol - TCP
StartPort - 8081
Endport - 0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.53.188/32
-------------------
Value Entry : 0
Protocol - ANY PROTOCOL
StartPort - 0
Endport - 0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.54.27/32
-------------------
Value Entry : 0
Protocol - TCP
StartPort - 8081
Endport - 0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.60.229/32
-------------------
Value Entry : 0
Protocol - TCP
StartPort - 8081
Endport - 0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.63.42/32
-------------------
Value Entry : 0
Protocol - TCP
StartPort - 8081
Endport - 0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.63.228/32
-------------------
Value Entry : 0
Protocol - TCP
StartPort - 8081
Endport - 0
-------------------
*******************************
Done reading all entries
ip-10-146-62-155.ec2.internal / aws-node-2756k
bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : ingress
Prog ID: 4022
Associated Maps ->
Map Name: policy_events
Map ID: 31
Map Name: aws_conntrack_map
Map ID: 30
Map Name: ingress_map
Map ID: 1125
========================================================================================
--
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : egress
Prog ID: 4023
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 30
Map Name: egress_map
Map ID: 1126
Map Name: policy_events
Map ID: 31
========================================================================================
========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 1125
Key : IP/Prefixlen - 10.146.62.155/32
-------------------
Value Entry : 0
Protocol - ANY PROTOCOL
StartPort - 0
Endport - 0
-------------------
*******************************
Done reading all entries
┃ ❯ k images aws-node-2756k -n kube-system
[Summary]: 1 namespaces, 1 pods, 3 containers and 3 different images
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
| Pod | Container | Image |
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
| aws-node-2756k | aws-node | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.16.2 |
+ +-------------------------+-----------------------------------------------------------------------------------------+
| | aws-eks-nodeagent | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-rc3 |
+ +-------------------------+-----------------------------------------------------------------------------------------+
| | (init) aws-vpc-cni-init | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.16.2 |
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
@aballman - Are these existing pods or did you delete and re-create new pods?
podSelectorEndpoints:
- hostIP: 10.146.53.188
name: argocd-repo-server-85ccb7dbdd-mkd4k
namespace: argocd
podIP: 10.146.54.42
- hostIP: 10.146.62.155
name: argocd-repo-server-85ccb7dbdd-rvhlm
namespace: argocd
podIP: 10.146.62.175
Can you also email us the network policy agent logs - /var/log/aws-routed-eni/network-policy-agent.log
? You can mail them to k8s-awscni-triage@amazon.com
@aballman - Are these existing pods or did you delete and re-create new pods?
podSelectorEndpoints: - hostIP: 10.146.53.188 name: argocd-repo-server-85ccb7dbdd-mkd4k namespace: argocd podIP: 10.146.54.42 - hostIP: 10.146.62.155 name: argocd-repo-server-85ccb7dbdd-rvhlm namespace: argocd podIP: 10.146.62.175
Can you also email us the network policy agent logs -
/var/log/aws-routed-eni/network-policy-agent.log
? You can mail them to k8s-awscni-triage@amazon.com
They were pre-existing at the time of the fault. I'm not sure why that pod might be a little younger. The node itself is ~17h old. There is an HPA configured on it, so that could be the reason. I'll send the logs over when the issue comes up again in a few hours.
Sorry, I meant did you re-create the pods post upgrade to v1.0.8-rc3?
Sorry, I meant did you re-create the pods post upgrade to v1.0.8-rc3?
I think that I had made the update to the daemonset before karpenter rolled all my nodes for the bottlerocket update. I will restart all the pods now just to be explicit about it.
The symptoms are still occurring with the updated rc3 image. I noticed my alerts for this triggered over the weekend but it resolved before I had a chance to collect logs. I'll follow up again when I can do that
@aballman - We did try the steps for repro and issue isn't happening and pods are running since 3days. Do you have any pod or node churn in your cluster? Logs would be helpful.
There is a pretty significant churn of both pods and nodes in the cluster. It has github actions runners in the same cluster / node pool. It is scaling up and down during the day to run jobs and also has some consolidation that's happening thanks to karpenter.
I'll post logs as soon as I can gather them. Thanks for investigating!
Thanks @aballman. Are you on K8s slack channel? We can get on a call and understand your cluster config. If so can you please share your slack handle?
I'm not sure if I can say this is resolved because of the issue that I saw two weekends ago. I can say that I haven't had any more issues since that time period. So if it's not fixed, it's considerably improved.
I'm willing to work under the assumption that it is fixed with 1.0.8-rc3 and can open a new issue referencing this one if it returns.
Thanks @aballman. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3
@jayanthvn This is still an issue for me. It's a lot less frequent, but it still occurs. This most recent one looks like this:
┃ ❯ kgpo -owide | grep -E "(argocd-repo-server|argocd-server)"
argocd-repo-server-67974b6df-pnpls 1/1 Running 0 127m 10.146.18.74 ip-10-146-17-182.ec2.internal <none> <none>
argocd-repo-server-67974b6df-s4d5c 1/1 Running 0 102m 10.146.27.6 ip-10-146-27-54.ec2.internal <none> <none>
argocd-server-665597f9d8-7pff6 1/1 Running 0 127m 10.146.16.12 ip-10-146-17-182.ec2.internal <none> <none>
argocd-server-665597f9d8-wgr84 1/1 Running 0 116m 10.146.26.137 ip-10-146-27-54.ec2.internal <none> <none>
bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_ingress
Pod Identifier : argocd-repo-server-67974b6df-argocd Direction : ingress
Prog ID: 302
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 33
Map Name: ingress_map
Map ID: 88
Map Name: policy_events
Map ID: 34
========================================================================================
--
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_egress
Pod Identifier : argocd-repo-server-67974b6df-argocd Direction : egress
Prog ID: 303
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 33
Map Name: egress_map
Map ID: 89
Map Name: policy_events
Map ID: 34
========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 88 | grep "Key"
Key : IP/Prefixlen - 10.146.16.5/32
Key : IP/Prefixlen - 10.146.16.12/32
Key : IP/Prefixlen - 10.146.16.21/32
Key : IP/Prefixlen - 10.146.16.22/32
Key : IP/Prefixlen - 10.146.16.43/32
Key : IP/Prefixlen - 10.146.16.47/32
Key : IP/Prefixlen - 10.146.16.99/32
Key : IP/Prefixlen - 10.146.16.116/32
Key : IP/Prefixlen - 10.146.16.157/32
Key : IP/Prefixlen - 10.146.16.191/32
Key : IP/Prefixlen - 10.146.17.52/32
Key : IP/Prefixlen - 10.146.17.149/32
Key : IP/Prefixlen - 10.146.17.182/32
Key : IP/Prefixlen - 10.146.18.149/32
Key : IP/Prefixlen - 10.146.18.150/32
Key : IP/Prefixlen - 10.146.18.157/32
Key : IP/Prefixlen - 10.146.18.162/32
Key : IP/Prefixlen - 10.146.18.166/32
Key : IP/Prefixlen - 10.146.18.199/32
Key : IP/Prefixlen - 10.146.19.19/32
Key : IP/Prefixlen - 10.146.19.54/32
Key : IP/Prefixlen - 10.146.19.123/32
Key : IP/Prefixlen - 10.146.19.180/32
Key : IP/Prefixlen - 10.146.19.193/32
Key : IP/Prefixlen - 10.146.22.225/32
Key : IP/Prefixlen - 10.146.23.126/32
Key : IP/Prefixlen - 10.146.24.84/32
Key : IP/Prefixlen - 10.146.24.98/32
Key : IP/Prefixlen - 10.146.24.125/32
Key : IP/Prefixlen - 10.146.24.173/32
Key : IP/Prefixlen - 10.146.26.56/32
Key : IP/Prefixlen - 10.146.26.112/32
Key : IP/Prefixlen - 10.146.26.137/32
Key : IP/Prefixlen - 10.146.26.146/32
Key : IP/Prefixlen - 10.146.26.226/32
Key : IP/Prefixlen - 10.146.27.54/32
Key : IP/Prefixlen - 10.146.27.209/32
Key : IP/Prefixlen - 10.146.29.209/32
Key : IP/Prefixlen - 10.146.30.250/32
bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_egress
Pod Identifier : argocd-repo-server-67974b6df-argocd Direction : egress
Prog ID: 399
Associated Maps ->
Map Name: policy_events
Map ID: 20
Map Name: aws_conntrack_map
Map ID: 19
Map Name: egress_map
Map ID: 119
========================================================================================
--
PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_ingress
Pod Identifier : argocd-repo-server-67974b6df-argocd Direction : ingress
Prog ID: 398
Associated Maps ->
Map Name: aws_conntrack_map
Map ID: 19
Map Name: ingress_map
Map ID: 118
Map Name: policy_events
Map ID: 20
========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 118 | grep "Key"
Key : IP/Prefixlen - 10.146.16.5/32
Key : IP/Prefixlen - 10.146.16.21/32
Key : IP/Prefixlen - 10.146.16.22/32
Key : IP/Prefixlen - 10.146.16.43/32
Key : IP/Prefixlen - 10.146.16.47/32
Key : IP/Prefixlen - 10.146.16.116/32
Key : IP/Prefixlen - 10.146.16.157/32
Key : IP/Prefixlen - 10.146.16.191/32
Key : IP/Prefixlen - 10.146.17.149/32
Key : IP/Prefixlen - 10.146.17.182/32
Key : IP/Prefixlen - 10.146.18.149/32
Key : IP/Prefixlen - 10.146.18.150/32
Key : IP/Prefixlen - 10.146.18.157/32
Key : IP/Prefixlen - 10.146.18.162/32
Key : IP/Prefixlen - 10.146.18.166/32
Key : IP/Prefixlen - 10.146.18.199/32
Key : IP/Prefixlen - 10.146.19.19/32
Key : IP/Prefixlen - 10.146.19.54/32
Key : IP/Prefixlen - 10.146.19.123/32
Key : IP/Prefixlen - 10.146.19.193/32
Key : IP/Prefixlen - 10.146.22.225/32
Key : IP/Prefixlen - 10.146.23.126/32
Key : IP/Prefixlen - 10.146.24.84/32
Key : IP/Prefixlen - 10.146.24.98/32
Key : IP/Prefixlen - 10.146.24.125/32
Key : IP/Prefixlen - 10.146.24.173/32
Key : IP/Prefixlen - 10.146.26.56/32
Key : IP/Prefixlen - 10.146.26.112/32
Key : IP/Prefixlen - 10.146.26.146/32
Key : IP/Prefixlen - 10.146.26.226/32
Key : IP/Prefixlen - 10.146.27.54/32
Key : IP/Prefixlen - 10.146.27.209/32
Key : IP/Prefixlen - 10.146.29.209/32
Key : IP/Prefixlen - 10.146.30.250/32
argocd-repo-server-67974b6df-pnpls
has the rules I expected given the network policy, which includes access from 10.146.16.12
and 10.146.26.137
argocd-repo-server-67974b6df-s4d5c
has rules from other network policies, but does not include access from 10.146.16.12
or 10.146.26.137
Those pod IPs are in the PolicyEndpoint so it seems like the map is being built wrong
I've emailed my network-policy-agent.log
file over to k8s-awscni-triage@amazon.com
@aballman - Thanks for checking. Wondering if some corner case here since none of the CIDRs in argocd-repo-server-8nwzb
are in the ingress map... Do you have the logs for argocd-repo-server-67974b6df-s4d5c
?
Thanks, got the logs. Will get back.
@jayanthvn Any updates on this? I believe we are still hitting same issue, even after upgrading VPC CNI to v1.16.4-eksbuild.2 (so network policy agent is at v1.0.8-eksbuild.1). Must say that frequency of the issue has dropped, but is not fully resolved.
@DomantasVar - We have identified a fix for this..right now testing the image. /cc @achevuru
@jayanthvn are there any updates on the progress regarding this issue? Since this is blocking important production migration for us, we are interested whether it's feasible to wait for issue resolution, or alternative migration path needs to be found.
Any update on this @jayanthvn :) ?
Sorry for the delay, we ran into few corner cases and had to rework few things. We will be running our regression suite and if things look green we will have the RC image probably by next week. Thanks for waiting!
Hello @jayanthvn, any new updates on the progress towards resolving this?
The issue is resolved with network policy agent version - 1.1.2 - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2
What happened:
I'm using ArgoCD (not super relevant to the issue) with CNI enforced network policies. The problem I'm experiencing is that after some time, the network policies seem to break, and one of the argo components can't talk to another one that is critical for argo to keep argo-ing.
Pods
``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES argocd-application-controller-0 1/1 Running 16 (106m ago) 23h 10.146.53.40 ip-10-146-52-181.ec2.internalServices
``` NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE argocd-application-controller-metrics ClusterIP 172.20.15.149Endpoints
``` NAME ENDPOINTS AGE argocd-application-controller-metrics 10.146.53.40:8082 161d argocd-applicationset-controller 10.146.54.253:7000 161d argocd-dex-server 10.146.52.212:5557,10.146.52.212:5556 161d argocd-notifications-controller-metrics 10.146.54.31:9001 145d argocd-redis 10.146.56.115:6379 161d argocd-redis-ha 10.146.53.217:26379,10.146.58.69:26379,10.146.61.205:26379 + 3 more... 161d argocd-redis-ha-announce-0 10.146.53.217:26379,10.146.53.217:6379 161d argocd-redis-ha-announce-1 10.146.61.205:26379,10.146.61.205:6379 161d argocd-redis-ha-announce-2 10.146.58.69:26379,10.146.58.69:6379 161d argocd-redis-ha-haproxy 10.146.55.24:6379,10.146.58.87:6379,10.146.60.227:6379 + 3 more... 161d argocd-repo-server 10.146.54.47:8081,10.146.60.213:8081 161d argocd-server 10.146.53.126:8080,10.146.60.14:8080,10.146.53.126:8080 + 1 more... 161d ```Network Policy
``` apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: argocd-repo-server namespace: argocd spec: ingress: - from: - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-server - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-application-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-notifications-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-applicationset-controller ports: - port: repo-server protocol: TCP podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server policyTypes: - Ingress ```Policy Endpoint
``` apiVersion: networking.k8s.aws/v1alpha1 kind: PolicyEndpoint metadata: creationTimestamp: "2024-02-02T00:46:35Z" generateName: argocd-repo-server- generation: 141 name: argocd-repo-server-sxvj2 namespace: argocd ownerReferences: - apiVersion: networking.k8s.io/v1 blockOwnerDeletion: true controller: true kind: NetworkPolicy name: argocd-repo-server uid: a57fcdb4-d425-4aa4-b818-61c9168debbf resourceVersion: "149304150" uid: df6dadb8-e619-4f72-ba98-82618b9f8256 spec: ingress: - cidr: 10.146.54.253 ports: - port: 8081 protocol: TCP - cidr: 10.146.53.126 ports: - port: 8081 protocol: TCP - cidr: 10.146.54.31 ports: - port: 8081 protocol: TCP - cidr: 10.146.53.40 ports: - port: 8081 protocol: TCP - cidr: 10.146.60.14 ports: - port: 8081 protocol: TCP podIsolation: - Ingress podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server podSelectorEndpoints: - hostIP: 10.146.60.223 name: argocd-repo-server-85ccb7dbdd-8txcw namespace: argocd podIP: 10.146.60.213 - hostIP: 10.146.52.181 name: argocd-repo-server-85ccb7dbdd-cssn8 namespace: argocd podIP: 10.146.54.47 policyRef: name: argocd-repo-server namespace: argocd ```The destination pods are
Using
/aws-eks-na-cli ebpf loaded-ebpfdata
I found the ebpf map corresponding to the pod on nodeip-10-146-60-223.ec2.internal
Here's the ebpf map dump from map `57` (good)
``` bash-4.2# /aws-eks-na-cli ebpf dump-maps 57 Key : IP/Prefixlen - 10.146.53.40/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.53.126/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.54.31/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.54.253/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.60.14/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.60.223/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Done reading all entries ```Doing the same for the other node `ip-10-146-52-181.ec2.internal` (bad)
``` bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server" PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : ingress Prog ID: 14411 Associated Maps -> Map Name: aws_conntrack_map Map ID: 9 Map Name: ingress_map Map ID: 4214 Map Name: policy_events Map ID: 10 ======================================================================================== -- PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : egress Prog ID: 14412 Associated Maps -> Map Name: policy_events Map ID: 10 Map Name: aws_conntrack_map Map ID: 9 Map Name: egress_map Map ID: 4215 ======================================================================================== ```One of the two pods seems to have an improperly built ebpf map relative to the policy endpoint. Here's a snippet of the most recent logs I could find referencing map
4214
I am able to resolve this issue if I restart the aws-node pod on the problem node. The timing on this is a bit odd. If I remove all the network policies and recreate, it takes several hours for this issue to manifest. However, the problem pod here at the time of investigation was only ~90m old.
Attach logs Log snippet attached, will provide more if requested
What you expected to happen: Expected eBPF map to match rules from Policy Endpoint for all destination pods
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): v1.28.5-eks-5e0fddecat /etc/os-release
): Bottlerocket 1.9.0uname -a
): 6.1.72