aws / aws-network-policy-agent

Apache License 2.0
42 stars 25 forks source link

Pod eBPF Map Differs from Policy Endpoint #206

Closed aballman closed 1 month ago

aballman commented 5 months ago

What happened:

I'm using ArgoCD (not super relevant to the issue) with CNI enforced network policies. The problem I'm experiencing is that after some time, the network policies seem to break, and one of the argo components can't talk to another one that is critical for argo to keep argo-ing.

Screenshot 2024-02-07 at 1 59 01 PM

Pods ``` NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES argocd-application-controller-0 1/1 Running 16 (106m ago) 23h 10.146.53.40 ip-10-146-52-181.ec2.internal argocd-applicationset-controller-7974ff9cf9-lvzsj 1/1 Running 0 20d 10.146.54.253 ip-10-146-52-181.ec2.internal argocd-dex-server-5c6dfff575-mhvzq 1/1 Running 0 20d 10.146.52.212 ip-10-146-52-181.ec2.internal argocd-notifications-controller-778866f977-sv7vh 1/1 Running 0 23h 10.146.54.31 ip-10-146-52-181.ec2.internal argocd-redis-5bcdf48d96-7f68c 1/1 Running 0 23h 10.146.56.115 ip-10-146-59-179.ec2.internal argocd-redis-ha-haproxy-7f84459cf-8mmkv 1/1 Running 0 33d 10.146.55.24 ip-10-146-52-181.ec2.internal argocd-redis-ha-haproxy-7f84459cf-jd2fj 1/1 Running 0 23h 10.146.60.227 ip-10-146-60-223.ec2.internal argocd-redis-ha-haproxy-7f84459cf-mbkdl 1/1 Running 0 9d 10.146.58.87 ip-10-146-56-47.ec2.internal argocd-redis-ha-server-0 3/3 Running 0 33d 10.146.53.217 ip-10-146-52-181.ec2.internal argocd-redis-ha-server-1 3/3 Running 0 23h 10.146.61.205 ip-10-146-61-114.ec2.internal argocd-redis-ha-server-2 3/3 Running 0 9d 10.146.58.69 ip-10-146-56-47.ec2.internal argocd-repo-server-85ccb7dbdd-8txcw 1/1 Running 0 23h 10.146.60.213 ip-10-146-60-223.ec2.internal argocd-repo-server-85ccb7dbdd-cssn8 1/1 Running 0 85m 10.146.54.47 ip-10-146-52-181.ec2.internal argocd-server-6d6cd7bc6b-mb8vl 1/1 Running 0 25h 10.146.53.126 ip-10-146-52-181.ec2.internal argocd-server-6d6cd7bc6b-xq7tm 1/1 Running 0 23h 10.146.60.14 ip-10-146-60-223.ec2.internal ```
Services ``` NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE argocd-application-controller-metrics ClusterIP 172.20.15.149 8082/TCP 161d argocd-applicationset-controller ClusterIP 172.20.49.132 7000/TCP 161d argocd-dex-server ClusterIP 172.20.220.123 5556/TCP,5557/TCP 161d argocd-notifications-controller-metrics ClusterIP 172.20.176.139 9001/TCP 145d argocd-redis ClusterIP 172.20.30.226 6379/TCP 161d argocd-redis-ha ClusterIP None 6379/TCP,26379/TCP 161d argocd-redis-ha-announce-0 ClusterIP 172.20.101.17 6379/TCP,26379/TCP 161d argocd-redis-ha-announce-1 ClusterIP 172.20.64.83 6379/TCP,26379/TCP 161d argocd-redis-ha-announce-2 ClusterIP 172.20.114.228 6379/TCP,26379/TCP 161d argocd-redis-ha-haproxy ClusterIP 172.20.187.207 6379/TCP,9101/TCP 161d argocd-repo-server ClusterIP 172.20.33.160 8081/TCP 161d argocd-server ClusterIP 172.20.203.179 80/TCP,443/TCP 161d ```
Endpoints ``` NAME ENDPOINTS AGE argocd-application-controller-metrics 10.146.53.40:8082 161d argocd-applicationset-controller 10.146.54.253:7000 161d argocd-dex-server 10.146.52.212:5557,10.146.52.212:5556 161d argocd-notifications-controller-metrics 10.146.54.31:9001 145d argocd-redis 10.146.56.115:6379 161d argocd-redis-ha 10.146.53.217:26379,10.146.58.69:26379,10.146.61.205:26379 + 3 more... 161d argocd-redis-ha-announce-0 10.146.53.217:26379,10.146.53.217:6379 161d argocd-redis-ha-announce-1 10.146.61.205:26379,10.146.61.205:6379 161d argocd-redis-ha-announce-2 10.146.58.69:26379,10.146.58.69:6379 161d argocd-redis-ha-haproxy 10.146.55.24:6379,10.146.58.87:6379,10.146.60.227:6379 + 3 more... 161d argocd-repo-server 10.146.54.47:8081,10.146.60.213:8081 161d argocd-server 10.146.53.126:8080,10.146.60.14:8080,10.146.53.126:8080 + 1 more... 161d ```
Network Policy ``` apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: argocd-repo-server namespace: argocd spec: ingress: - from: - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-server - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-application-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-notifications-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-applicationset-controller ports: - port: repo-server protocol: TCP podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server policyTypes: - Ingress ```
Policy Endpoint ``` apiVersion: networking.k8s.aws/v1alpha1 kind: PolicyEndpoint metadata: creationTimestamp: "2024-02-02T00:46:35Z" generateName: argocd-repo-server- generation: 141 name: argocd-repo-server-sxvj2 namespace: argocd ownerReferences: - apiVersion: networking.k8s.io/v1 blockOwnerDeletion: true controller: true kind: NetworkPolicy name: argocd-repo-server uid: a57fcdb4-d425-4aa4-b818-61c9168debbf resourceVersion: "149304150" uid: df6dadb8-e619-4f72-ba98-82618b9f8256 spec: ingress: - cidr: 10.146.54.253 ports: - port: 8081 protocol: TCP - cidr: 10.146.53.126 ports: - port: 8081 protocol: TCP - cidr: 10.146.54.31 ports: - port: 8081 protocol: TCP - cidr: 10.146.53.40 ports: - port: 8081 protocol: TCP - cidr: 10.146.60.14 ports: - port: 8081 protocol: TCP podIsolation: - Ingress podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server podSelectorEndpoints: - hostIP: 10.146.60.223 name: argocd-repo-server-85ccb7dbdd-8txcw namespace: argocd podIP: 10.146.60.213 - hostIP: 10.146.52.181 name: argocd-repo-server-85ccb7dbdd-cssn8 namespace: argocd podIP: 10.146.54.47 policyRef: name: argocd-repo-server namespace: argocd ```

The destination pods are

argocd-repo-server-85ccb7dbdd-8txcw                 1/1     Running     0               23h     10.146.60.213   ip-10-146-60-223.ec2.internal   <none>           <none>
argocd-repo-server-85ccb7dbdd-cssn8                 1/1     Running     0               85m     10.146.54.47    ip-10-146-52-181.ec2.internal   <none>           <none>

Using /aws-eks-na-cli ebpf loaded-ebpfdata I found the ebpf map corresponding to the pod on node ip-10-146-60-223.ec2.internal

bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : ingress
Prog ID:  211
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  17
Map Name:  ingress_map
Map ID:  57
Map Name:  policy_events
Map ID:  18
========================================================================================
--
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : egress
Prog ID:  212
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  17
Map Name:  egress_map
Map ID:  58
Map Name:  policy_events
Map ID:  18
========================================================================================
Here's the ebpf map dump from map `57` (good) ``` bash-4.2# /aws-eks-na-cli ebpf dump-maps 57 Key : IP/Prefixlen - 10.146.53.40/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.53.126/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.54.31/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.54.253/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.60.14/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.60.223/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Done reading all entries ```
Doing the same for the other node `ip-10-146-52-181.ec2.internal` (bad) ``` bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server" PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : ingress Prog ID: 14411 Associated Maps -> Map Name: aws_conntrack_map Map ID: 9 Map Name: ingress_map Map ID: 4214 Map Name: policy_events Map ID: 10 ======================================================================================== -- PinPath: /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd Direction : egress Prog ID: 14412 Associated Maps -> Map Name: policy_events Map ID: 10 Map Name: aws_conntrack_map Map ID: 9 Map Name: egress_map Map ID: 4215 ======================================================================================== ```
bash-4.2# /aws-eks-na-cli ebpf dump-maps 4214
Key : IP/Prefixlen - 10.146.52.181/32
-------------------
Value Entry :  0
Protocol -  ANY PROTOCOL
StartPort -  0
Endport -  0
-------------------
*******************************
Done reading all entries

One of the two pods seems to have an improperly built ebpf map relative to the policy endpoint. Here's a snippet of the most recent logs I could find referencing map 4214

{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:436","msg":"ID of map to update: ","ID: ":4214}
{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:278","msg":"Pod has an Egress hook attached. Update the corresponding map","progFD: ":45,"mapName: ":"egress_map"}
{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:707","msg":"L4 values: ","protocol: ":254,"startPort: ":0,"endPort: ":0}
{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:707","msg":"Current L4 entry count for catch all entry: ","count: ":0}
{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:707","msg":"Total L4 entry count for catch all entry: ","count: ":0}
{"level":"info","ts":"2024-02-07T22:00:08.024Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:707","msg":"L4 values: ","protocol: ":254,"startPort: ":0,"endPort: ":0}

I am able to resolve this issue if I restart the aws-node pod on the problem node. The timing on this is a bit odd. If I remove all the network policies and recreate, it takes several hours for this issue to manifest. However, the problem pod here at the time of investigation was only ~90m old.


Attach logs Log snippet attached, will provide more if requested

What you expected to happen: Expected eBPF map to match rules from Policy Endpoint for all destination pods

How to reproduce it (as minimally and precisely as possible):

helm repo add argo https://argoproj.github.io/argo-helm && helm repo update
helm upgrade argocd -n argocd argo/argo-cd --version 5.53.11 \
    --set global.networkPolicy.create=true \
    --create-namespace --install

Anything else we need to know?:

Environment:

aballman commented 5 months ago

https://github.com/aws/aws-network-policy-agent/issues/183 Seems similar to my issue but I'm using the release candidate version that's referenced and reported as having fixed that particular issue.

jayanthvn commented 5 months ago

@aballman - v1.0.8-rc3 is the latest. We hit a similar issue where the maps got wrongly updated. Can you please try v1.0.8-rc3?

aballman commented 5 months ago

@aballman - v1.0.8-rc3 is the latest. We hit a similar issue where the maps got wrongly updated. Can you please try v1.0.8-rc3?

Thanks! I'll give it a shot

aballman commented 5 months ago

Unfortunately this did not resolve my issue. The same problem is present. I've confirmed that i'm on v1.0.8-rc3 on the problem node. I also had rolled all nodes in my cluster ~16h ago when the Bottlerocket 1.19.1 fix was released.

Curiously, it seems to be a similar scenario, where the problem pod was on the node for ~90m.

NAME                                                READY   STATUS      RESTARTS       AGE     IP              NODE                            NOMINATED NODE   READINESS GATES
argocd-application-controller-0                     1/1     Running     36 (65m ago)   16h     10.146.63.42    ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-applicationset-controller-7974ff9cf9-vjppv   1/1     Running     0              16h     10.146.63.228   ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-dex-server-5c6dfff575-wrl7v                  1/1     Running     0              16h     10.146.53.41    ip-10-146-53-188.ec2.internal   <none>           <none>
argocd-notifications-controller-778866f977-9nhdd    1/1     Running     0              16h     10.146.60.229   ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-redis-5bcdf48d96-x8bqp                       1/1     Running     0              16h     10.146.62.162   ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-redis-ha-haproxy-7f84459cf-pmdfv             1/1     Running     0              16h     10.146.56.174   ip-10-146-57-151.ec2.internal   <none>           <none>
argocd-redis-ha-haproxy-7f84459cf-tcdsr             1/1     Running     0              19h     10.146.54.58    ip-10-146-55-43.ec2.internal    <none>           <none>
argocd-redis-ha-haproxy-7f84459cf-xs6dp             1/1     Running     0              16h     10.146.53.99    ip-10-146-53-188.ec2.internal   <none>           <none>
argocd-redis-ha-server-0                            3/3     Running     0              16h     10.146.58.63    ip-10-146-57-151.ec2.internal   <none>           <none>
argocd-redis-ha-server-1                            3/3     Running     0              16h     10.146.63.193   ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-redis-ha-server-2                            3/3     Running     0              16h     10.146.52.251   ip-10-146-55-43.ec2.internal    <none>           <none>
argocd-repo-server-85ccb7dbdd-mkd4k                 1/1     Running     0              16h     10.146.54.42    ip-10-146-53-188.ec2.internal   <none>           <none>
argocd-repo-server-85ccb7dbdd-rvhlm                 1/1     Running     0              86m     10.146.62.175   ip-10-146-62-155.ec2.internal   <none>           <none>
argocd-server-6d6cd7bc6b-mccvn                      1/1     Running     0              16h     10.146.54.27    ip-10-146-53-188.ec2.internal   <none>           <none>
argocd-server-6d6cd7bc6b-pbbkd                      1/1     Running     0              19h     10.146.53.5     ip-10-146-55-43.ec2.internal    <none>           <none>
apiVersion: networking.k8s.aws/v1alpha1
kind: PolicyEndpoint
metadata:
  creationTimestamp: "2024-02-02T00:46:35Z"
  generateName: argocd-repo-server-
  generation: 243
  name: argocd-repo-server-sxvj2
  namespace: argocd
  ownerReferences:
  - apiVersion: networking.k8s.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: NetworkPolicy
    name: argocd-repo-server
    uid: a57fcdb4-d425-4aa4-b818-61c9168debbf
  resourceVersion: "150208318"
  uid: df6dadb8-e619-4f72-ba98-82618b9f8256
spec:
  ingress:
  - cidr: 10.146.53.5
    ports:
    - port: 8081
      protocol: TCP
  - cidr: 10.146.54.27
    ports:
    - port: 8081
      protocol: TCP
  - cidr: 10.146.60.229
    ports:
    - port: 8081
      protocol: TCP
  - cidr: 10.146.63.228
    ports:
    - port: 8081
      protocol: TCP
  - cidr: 10.146.63.42
    ports:
    - port: 8081
      protocol: TCP
  podIsolation:
  - Ingress
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: argocd
      app.kubernetes.io/name: argocd-repo-server
  podSelectorEndpoints:
  - hostIP: 10.146.53.188
    name: argocd-repo-server-85ccb7dbdd-mkd4k
    namespace: argocd
    podIP: 10.146.54.42
  - hostIP: 10.146.62.155
    name: argocd-repo-server-85ccb7dbdd-rvhlm
    namespace: argocd
    podIP: 10.146.62.175
  policyRef:
    name: argocd-repo-server
    namespace: argocd

ip-10-146-53-188.ec2.internal / aws-node-hntbh

bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : egress
Prog ID:  108
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  19
Map Name:  egress_map
Map ID:  29
Map Name:  policy_events
Map ID:  20
========================================================================================
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : ingress
Prog ID:  107
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  19
Map Name:  ingress_map
Map ID:  28
Map Name:  policy_events
Map ID:  20
========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 28
Key : IP/Prefixlen - 10.146.53.5/32
-------------------
Value Entry :  0
Protocol -  TCP
StartPort -  8081
Endport -  0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.53.188/32
-------------------
Value Entry :  0
Protocol -  ANY PROTOCOL
StartPort -  0
Endport -  0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.54.27/32
-------------------
Value Entry :  0
Protocol -  TCP
StartPort -  8081
Endport -  0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.60.229/32
-------------------
Value Entry :  0
Protocol -  TCP
StartPort -  8081
Endport -  0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.63.42/32
-------------------
Value Entry :  0
Protocol -  TCP
StartPort -  8081
Endport -  0
-------------------
*******************************
Key : IP/Prefixlen - 10.146.63.228/32
-------------------
Value Entry :  0
Protocol -  TCP
StartPort -  8081
Endport -  0
-------------------
*******************************
Done reading all entries

ip-10-146-62-155.ec2.internal / aws-node-2756k

bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_ingress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : ingress
Prog ID:  4022
Associated Maps ->
Map Name:  policy_events
Map ID:  31
Map Name:  aws_conntrack_map
Map ID:  30
Map Name:  ingress_map
Map ID:  1125
========================================================================================
--
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-85ccb7dbdd-argocd_handle_egress
Pod Identifier : argocd-repo-server-85ccb7dbdd-argocd  Direction : egress
Prog ID:  4023
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  30
Map Name:  egress_map
Map ID:  1126
Map Name:  policy_events
Map ID:  31
========================================================================================

========================================================================================
bash-4.2# /aws-eks-na-cli ebpf dump-maps 1125
Key : IP/Prefixlen - 10.146.62.155/32
-------------------
Value Entry :  0
Protocol -  ANY PROTOCOL
StartPort -  0
Endport -  0
-------------------
*******************************
Done reading all entries
┃ ❯ k images aws-node-2756k -n kube-system
[Summary]: 1 namespaces, 1 pods, 3 containers and 3 different images
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
|      Pod       |        Container        |                                          Image                                          |
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
| aws-node-2756k | aws-node                | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.16.2                     |
+                +-------------------------+-----------------------------------------------------------------------------------------+
|                | aws-eks-nodeagent       | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8-rc3 |
+                +-------------------------+-----------------------------------------------------------------------------------------+
|                | (init) aws-vpc-cni-init | 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.16.2                |
+----------------+-------------------------+-----------------------------------------------------------------------------------------+
jayanthvn commented 5 months ago

@aballman - Are these existing pods or did you delete and re-create new pods?

  podSelectorEndpoints:
  - hostIP: 10.146.53.188
    name: argocd-repo-server-85ccb7dbdd-mkd4k
    namespace: argocd
    podIP: 10.146.54.42
  - hostIP: 10.146.62.155
    name: argocd-repo-server-85ccb7dbdd-rvhlm
    namespace: argocd
    podIP: 10.146.62.175

Can you also email us the network policy agent logs - /var/log/aws-routed-eni/network-policy-agent.log? You can mail them to k8s-awscni-triage@amazon.com

aballman commented 5 months ago

@aballman - Are these existing pods or did you delete and re-create new pods?

  podSelectorEndpoints:
  - hostIP: 10.146.53.188
    name: argocd-repo-server-85ccb7dbdd-mkd4k
    namespace: argocd
    podIP: 10.146.54.42
  - hostIP: 10.146.62.155
    name: argocd-repo-server-85ccb7dbdd-rvhlm
    namespace: argocd
    podIP: 10.146.62.175

Can you also email us the network policy agent logs - /var/log/aws-routed-eni/network-policy-agent.log? You can mail them to k8s-awscni-triage@amazon.com

They were pre-existing at the time of the fault. I'm not sure why that pod might be a little younger. The node itself is ~17h old. There is an HPA configured on it, so that could be the reason. I'll send the logs over when the issue comes up again in a few hours.

jayanthvn commented 5 months ago

Sorry, I meant did you re-create the pods post upgrade to v1.0.8-rc3?

aballman commented 5 months ago

Sorry, I meant did you re-create the pods post upgrade to v1.0.8-rc3?

I think that I had made the update to the daemonset before karpenter rolled all my nodes for the bottlerocket update. I will restart all the pods now just to be explicit about it.

aballman commented 5 months ago

The symptoms are still occurring with the updated rc3 image. I noticed my alerts for this triggered over the weekend but it resolved before I had a chance to collect logs. I'll follow up again when I can do that

jayanthvn commented 5 months ago

@aballman - We did try the steps for repro and issue isn't happening and pods are running since 3days. Do you have any pod or node churn in your cluster? Logs would be helpful.

aballman commented 5 months ago

There is a pretty significant churn of both pods and nodes in the cluster. It has github actions runners in the same cluster / node pool. It is scaling up and down during the day to run jobs and also has some consolidation that's happening thanks to karpenter.

I'll post logs as soon as I can gather them. Thanks for investigating!

jayanthvn commented 5 months ago

Thanks @aballman. Are you on K8s slack channel? We can get on a call and understand your cluster config. If so can you please share your slack handle?

aballman commented 5 months ago

I'm not sure if I can say this is resolved because of the issue that I saw two weekends ago. I can say that I haven't had any more issues since that time period. So if it's not fixed, it's considerably improved.

I'm willing to work under the assumption that it is fixed with 1.0.8-rc3 and can open a new issue referencing this one if it returns.

jayanthvn commented 5 months ago

Thanks @aballman. Please keep us updated. v1.0.8 release is available - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

aballman commented 5 months ago

@jayanthvn This is still an issue for me. It's a lot less frequent, but it still occurs. This most recent one looks like this:

┃ ❯ kgpo -owide | grep -E "(argocd-repo-server|argocd-server)"
argocd-repo-server-67974b6df-pnpls                  1/1     Running     0          127m    10.146.18.74    ip-10-146-17-182.ec2.internal   <none>           <none>
argocd-repo-server-67974b6df-s4d5c                  1/1     Running     0          102m    10.146.27.6     ip-10-146-27-54.ec2.internal    <none>           <none>
argocd-server-665597f9d8-7pff6                      1/1     Running     0          127m    10.146.16.12    ip-10-146-17-182.ec2.internal   <none>           <none>
argocd-server-665597f9d8-wgr84                      1/1     Running     0          116m    10.146.26.137   ip-10-146-27-54.ec2.internal    <none>           <none>

ip-10-146-17-182.ec2.internal - argocd-repo-server-67974b6df-pnpls

bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_ingress
Pod Identifier : argocd-repo-server-67974b6df-argocd  Direction : ingress
Prog ID:  302
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  33
Map Name:  ingress_map
Map ID:  88
Map Name:  policy_events
Map ID:  34
========================================================================================
--
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_egress
Pod Identifier : argocd-repo-server-67974b6df-argocd  Direction : egress
Prog ID:  303
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  33
Map Name:  egress_map
Map ID:  89
Map Name:  policy_events
Map ID:  34
========================================================================================
Full Ingress Map ``` bash-4.2# /aws-eks-na-cli ebpf dump-maps 88 Key : IP/Prefixlen - 10.146.16.5/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.12/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.21/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.22/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.43/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.47/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.99/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.116/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.157/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.191/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.17.52/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.17.149/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.17.182/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.149/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.150/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.157/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.162/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.166/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.182/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.196/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.199/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.19/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.54/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.123/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.180/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.193/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.22.225/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.23.126/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.84/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.98/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.125/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.173/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.56/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.112/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.137/32 ------------------- Value Entry : 0 Protocol - TCP StartPort - 8081 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.146/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.226/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.27.54/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.27.209/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.29.209/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.30.250/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Done reading all entries ```
bash-4.2# /aws-eks-na-cli ebpf dump-maps 88 | grep "Key"
Key : IP/Prefixlen - 10.146.16.5/32
Key : IP/Prefixlen - 10.146.16.12/32
Key : IP/Prefixlen - 10.146.16.21/32
Key : IP/Prefixlen - 10.146.16.22/32
Key : IP/Prefixlen - 10.146.16.43/32
Key : IP/Prefixlen - 10.146.16.47/32
Key : IP/Prefixlen - 10.146.16.99/32
Key : IP/Prefixlen - 10.146.16.116/32
Key : IP/Prefixlen - 10.146.16.157/32
Key : IP/Prefixlen - 10.146.16.191/32
Key : IP/Prefixlen - 10.146.17.52/32
Key : IP/Prefixlen - 10.146.17.149/32
Key : IP/Prefixlen - 10.146.17.182/32
Key : IP/Prefixlen - 10.146.18.149/32
Key : IP/Prefixlen - 10.146.18.150/32
Key : IP/Prefixlen - 10.146.18.157/32
Key : IP/Prefixlen - 10.146.18.162/32
Key : IP/Prefixlen - 10.146.18.166/32
Key : IP/Prefixlen - 10.146.18.199/32
Key : IP/Prefixlen - 10.146.19.19/32
Key : IP/Prefixlen - 10.146.19.54/32
Key : IP/Prefixlen - 10.146.19.123/32
Key : IP/Prefixlen - 10.146.19.180/32
Key : IP/Prefixlen - 10.146.19.193/32
Key : IP/Prefixlen - 10.146.22.225/32
Key : IP/Prefixlen - 10.146.23.126/32
Key : IP/Prefixlen - 10.146.24.84/32
Key : IP/Prefixlen - 10.146.24.98/32
Key : IP/Prefixlen - 10.146.24.125/32
Key : IP/Prefixlen - 10.146.24.173/32
Key : IP/Prefixlen - 10.146.26.56/32
Key : IP/Prefixlen - 10.146.26.112/32
Key : IP/Prefixlen - 10.146.26.137/32
Key : IP/Prefixlen - 10.146.26.146/32
Key : IP/Prefixlen - 10.146.26.226/32
Key : IP/Prefixlen - 10.146.27.54/32
Key : IP/Prefixlen - 10.146.27.209/32
Key : IP/Prefixlen - 10.146.29.209/32
Key : IP/Prefixlen - 10.146.30.250/32

ip-10-146-27-54.ec2.internal - argocd-repo-server-67974b6df-s4d5c

bash-4.2# /aws-eks-na-cli ebpf loaded-ebpfdata | grep -A9 "repo-server"
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_egress
Pod Identifier : argocd-repo-server-67974b6df-argocd  Direction : egress
Prog ID:  399
Associated Maps ->
Map Name:  policy_events
Map ID:  20
Map Name:  aws_conntrack_map
Map ID:  19
Map Name:  egress_map
Map ID:  119
========================================================================================
--
PinPath:  /sys/fs/bpf/globals/aws/programs/argocd-repo-server-67974b6df-argocd_handle_ingress
Pod Identifier : argocd-repo-server-67974b6df-argocd  Direction : ingress
Prog ID:  398
Associated Maps ->
Map Name:  aws_conntrack_map
Map ID:  19
Map Name:  ingress_map
Map ID:  118
Map Name:  policy_events
Map ID:  20
========================================================================================
Full Ingress Map ``` bash-4.2# /aws-eks-na-cli ebpf dump-maps 118 Key : IP/Prefixlen - 10.146.16.5/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.21/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.22/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.43/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.47/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.116/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.157/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.16.191/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.17.149/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.17.182/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.149/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.150/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.157/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.162/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.166/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.18.199/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.19/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.54/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.123/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.19.193/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.22.225/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.23.126/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.84/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.98/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.125/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.24.173/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.56/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.112/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.146/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.26.226/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.27.54/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.27.209/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.29.209/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Key : IP/Prefixlen - 10.146.30.250/32 ------------------- Value Entry : 0 Protocol - ANY PROTOCOL StartPort - 0 Endport - 0 ------------------- ******************************* Done reading all entries ```
bash-4.2# /aws-eks-na-cli ebpf dump-maps 118 | grep "Key"
Key : IP/Prefixlen - 10.146.16.5/32
Key : IP/Prefixlen - 10.146.16.21/32
Key : IP/Prefixlen - 10.146.16.22/32
Key : IP/Prefixlen - 10.146.16.43/32
Key : IP/Prefixlen - 10.146.16.47/32
Key : IP/Prefixlen - 10.146.16.116/32
Key : IP/Prefixlen - 10.146.16.157/32
Key : IP/Prefixlen - 10.146.16.191/32
Key : IP/Prefixlen - 10.146.17.149/32
Key : IP/Prefixlen - 10.146.17.182/32
Key : IP/Prefixlen - 10.146.18.149/32
Key : IP/Prefixlen - 10.146.18.150/32
Key : IP/Prefixlen - 10.146.18.157/32
Key : IP/Prefixlen - 10.146.18.162/32
Key : IP/Prefixlen - 10.146.18.166/32
Key : IP/Prefixlen - 10.146.18.199/32
Key : IP/Prefixlen - 10.146.19.19/32
Key : IP/Prefixlen - 10.146.19.54/32
Key : IP/Prefixlen - 10.146.19.123/32
Key : IP/Prefixlen - 10.146.19.193/32
Key : IP/Prefixlen - 10.146.22.225/32
Key : IP/Prefixlen - 10.146.23.126/32
Key : IP/Prefixlen - 10.146.24.84/32
Key : IP/Prefixlen - 10.146.24.98/32
Key : IP/Prefixlen - 10.146.24.125/32
Key : IP/Prefixlen - 10.146.24.173/32
Key : IP/Prefixlen - 10.146.26.56/32
Key : IP/Prefixlen - 10.146.26.112/32
Key : IP/Prefixlen - 10.146.26.146/32
Key : IP/Prefixlen - 10.146.26.226/32
Key : IP/Prefixlen - 10.146.27.54/32
Key : IP/Prefixlen - 10.146.27.209/32
Key : IP/Prefixlen - 10.146.29.209/32
Key : IP/Prefixlen - 10.146.30.250/32
Network Policy ``` apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: argocd-repo-server namespace: argocd spec: ingress: - from: - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-server - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-application-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-notifications-controller - podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-applicationset-controller ports: - port: 8081 protocol: TCP podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server policyTypes: - Ingress ```
Policy Endpoint ``` apiVersion: networking.k8s.aws/v1alpha1 kind: PolicyEndpoint metadata: name: argocd-repo-server-8nwzb namespace: argocd spec: ingress: - cidr: 10.146.17.52 ports: - port: 8081 protocol: TCP - port: 8081 protocol: TCP - cidr: 10.146.19.180 ports: - port: 8081 protocol: TCP - port: 8081 protocol: TCP - cidr: 10.146.16.99 ports: - port: 8081 protocol: TCP - port: 8081 protocol: TCP - cidr: 10.146.26.137 ports: - port: 8081 protocol: TCP - port: 8081 protocol: TCP - cidr: 10.146.16.12 ports: - port: 8081 protocol: TCP - port: 8081 protocol: TCP podIsolation: - Ingress podSelector: matchLabels: app.kubernetes.io/instance: argocd app.kubernetes.io/name: argocd-repo-server podSelectorEndpoints: - hostIP: 10.146.17.182 name: argocd-repo-server-67974b6df-pnpls namespace: argocd podIP: 10.146.18.74 - hostIP: 10.146.27.54 name: argocd-repo-server-67974b6df-s4d5c namespace: argocd podIP: 10.146.27.6 policyRef: name: argocd-repo-server-supplemental namespace: argocd ```

argocd-repo-server-67974b6df-pnpls has the rules I expected given the network policy, which includes access from 10.146.16.12 and 10.146.26.137 argocd-repo-server-67974b6df-s4d5c has rules from other network policies, but does not include access from 10.146.16.12 or 10.146.26.137

Those pod IPs are in the PolicyEndpoint so it seems like the map is being built wrong

aballman commented 5 months ago

I've emailed my network-policy-agent.log file over to k8s-awscni-triage@amazon.com

jayanthvn commented 5 months ago

@aballman - Thanks for checking. Wondering if some corner case here since none of the CIDRs in argocd-repo-server-8nwzb are in the ingress map... Do you have the logs for argocd-repo-server-67974b6df-s4d5c?

jayanthvn commented 5 months ago

Thanks, got the logs. Will get back.

DomantasVar commented 4 months ago

@jayanthvn Any updates on this? I believe we are still hitting same issue, even after upgrading VPC CNI to v1.16.4-eksbuild.2 (so network policy agent is at v1.0.8-eksbuild.1). Must say that frequency of the issue has dropped, but is not fully resolved.

jayanthvn commented 4 months ago

@DomantasVar - We have identified a fix for this..right now testing the image. /cc @achevuru

DomantasVar commented 4 months ago

@jayanthvn are there any updates on the progress regarding this issue? Since this is blocking important production migration for us, we are interested whether it's feasible to wait for issue resolution, or alternative migration path needs to be found.

ThibaultLengagne commented 3 months ago

Any update on this @jayanthvn :) ?

jayanthvn commented 3 months ago

Sorry for the delay, we ran into few corner cases and had to rework few things. We will be running our regression suite and if things look green we will have the RC image probably by next week. Thanks for waiting!

DomantasVar commented 2 months ago

Hello @jayanthvn, any new updates on the progress towards resolving this?

jayanthvn commented 1 month ago

The issue is resolved with network policy agent version - 1.1.2 - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2