aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.25k stars 733 forks source link

aws eniconfig not being honored and pods using trunk instead of ENI #2792

Closed sstarcher closed 6 months ago

sstarcher commented 6 months ago

What happened: Upgraded from aws-vpc-cni v1.12.6 to v1.16.0. Pods sometimes get assigned to the trunk interface instead of to the ENI. This causes them to not get the correct security groups from the ENIConfig. A small sample size this seems to be pods that got assigned to the node just as it is coming up.

Attach logs

snippet of logs remainder sent

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.381Z","caller":"ipamd/ipamd.go:822","msg":"Found ENI Config Name: eni-config-ds-subnet-0318d75ae06a34052"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"ipamd/ipamd.go:793","msg":"ipamd: using custom network config: [sg-066233d33bbd94a21 sg-03d0fde3a6a691a6d], subnet-0318d75ae06a34052"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Trying to allocate 10 IP addresses on new ENI"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Using a custom network config for the new ENI"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.483Z","caller":"awsutils/awsutils.go:728","msg":"Creating ENI with security groups: [sg-066233d33bbd94a21 sg-03d0fde3a6a691a6d] in subnet: subnet-0318d75ae06a34052"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:12.910Z","caller":"awsutils/awsutils.go:728","msg":"Created a new ENI: eni-0a9379b23fe4ae3e1"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:14.216Z","caller":"ipamd/ipamd.go:838","msg":"Successfully created and attached a new ENI eni-0a9379b23fe4ae3e1 to instance"}

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"ipamd/ipamd.go:1097","msg":"Added ENI(eni-0a9379b23fe4ae3e1)'s IP/Prefix 10.110.130.178/32 to datastore"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"aws-k8s-agent/main.go:91","msg":"Serving RPC Handler version on 127.0.0.1:50051"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"runtime/asm_amd64.s:1650","msg":"Serving metrics on port 61678"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"ipamd/introspect.go:54","msg":"Serving introspection endpoints on 127.0.0.1:61679"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:15.006Z","caller":"runtime/asm_amd64.s:1650","msg":"Setting up shutdown hook."} aws-node-f5x7z aws-node time="2024-02-13T13:19:15Z" level=info msg="Copying config file... " aws-node-f5x7z aws-node time="2024-02-13T13:19:15Z" level=info msg="Successfully copied CNI plugin binary and config file." aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:19:16.286Z","caller":"rpc/rpc.pb.go:713","msg":"Received AddNetwork for NS /var/run/netns/cni-2b9d3041-7418-d61e-ac01-8fd27033c5c1, Sandbox cae024af5e5ae0dcb7c76f9496f018620d59c5671e33ed684e02040e6b40628d, ifname eth0"}

aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:38.624Z","caller":"ipamd/ipamd.go:1097","msg":"Adding 10.110.140.234/32 to DS for eni-03928b37b5d4f3d56"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:38.624Z","caller":"ipamd/ipamd.go:1097","msg":"Added ENI(eni-03928b37b5d4f3d56)'s IP/Prefix 10.110.140.234/32 to datastore"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:43.953Z","caller":"datastore/data_store.go:714","msg":"assignPodIPAddressUnsafe: Assign IP 10.110.140.234 to sandbox aws-cni/f05f22e23f215ad08041ffd6c25663eeaca757df843f0571920e15714ce7f683/eth0"} aws-node-f5x7z aws-node {"level":"info","ts":"2024-02-13T13:20:43.974Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr 10.110.140.234, IPv6Addr: , DeviceNumber: 7, err: "}

What you expected to happen:

Pod to be correctly assigned to an ENI with the correct security groups

How to reproduce it (as minimally and precisely as possible):

We set the following settings and in addition use calico

    --set env.MAX_ENI=${MAX_ENI_PER_WORKER:-3} \
    --set env.WARM_IP_TARGET=${WARM_IPS_PER_WORKER:-1} \
    --set env.MINIMUM_IP_TARGET=${MIN_IPS_PER_WORKER:-3} \
    --set env.WARM_ENI_TARGET=${WARM_ENI_PER_WORKER:-1} \
    --set eniConfig.region=${region} \
    --set image.region=${region} \
    --set init.image.region=${region} \
    --set nodeAgent.image.region=${region}

priorityClassName: "system-node-critical" env:

see # https://github.com/aws/amazon-vpc-cni-k8s/blob/7ab227ecbd14623456ea794e893696c2bd66f2b9/README.md

AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true" AWS_VPC_K8S_CNI_EXTERNALSNAT: "true" ANNOTATE_POD_IP: "true" ENABLE_POD_ENI: "true" POD_SECURITY_GROUP_ENFORCING_MODE: "standard"

AWS_VPC_K8S_CNI_LOGLEVEL: "INFO" AWS_VPC_K8S_PLUGIN_LOG_LEVEL: "INFO"

AWS_VPC_K8S_CNI_LOG_FILE: "stdout"

readinessProbe: initialDelaySeconds: 5

Anything else we need to know?:

Environment:

jdn5126 commented 6 months ago

@sstarcher it looks like you are using Security Groups for Pods with Custom Networking. In this case, it is the VPC Resource Controller which will handle ENI allocation and pod placement.

If the pod matches a Security Group policy, then it will be annotated by the VPC Resource Controller and placed behind the trunk ENI. If the pod does not match a Security Group policy, then it will be placed behind a regular ENI, which was allocated based on the Custom Networking spec.

Did you terminate your nodes after setting ENABLE_POD_ENI? This is a required step for Security Groups for Pods, as the controller needs to be able to build its internal state properly. I am not aware of any race conditions. If the pod matches a Security Group policy, it should match every time.

Also I see that you are using Calico. You are just using Calico for network policy, right? As we may need controller logs to debug this further, I suggest opening up an AWS support case.

sstarcher commented 6 months ago

Thanks I'll open a support ticket. All of the settings have been in place for months. The only change here is the chart version.

sstarcher commented 6 months ago

The VPC Resource Controller is that embedded in somehow? We are using Security Groups for Pods and have been for a while, but only have this amazon-vpc-cni-k8s helm chart installed.

jdn5126 commented 6 months ago

The VPC Resource Controller runs in the EKS-managed control plane. The major change between v1.12.6 and v1.16.0 is the VPC CNI using the CNINode CRD to communicate with the controller instead of the vpc.amazonaws.com/has-trunk-attached node label.

If no Security Group policy matches this pod, then it should not be annotated and placed behind the trunk ENI. When you describe the pod, do you see an annotation with vpc.amazonaws.com/pod-eni? If not, how do you know that it is behind the trunk ENI?

sstarcher commented 6 months ago

I'll have to recreate to check the annotation. I found it using the trunk ENI because I took the pod IP and searched the interfaces. I found the working pods had an interface where the non-working used trunk and noticed the security groups were wrong.

TRUNK Does Not work aws-k8s-trunk-eni eni-02eb9c28496b9765d - 5 IPs ENI - Does work aws-K8S-i-03c758f943b4ca6d3 eni-03edf562fcac0125d - 14 IPs

^ those interfaces were both defined for the same node.

stanvit commented 6 months ago

We are facing a similar problem: our nodes are launched in public subnets, with ENIConfigs defining private subnets and Security Groups for pods. As per AWS EKS addon config:

AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
ENABLE_PREFIX_DELEGATION           = "true"
ENABLE_POD_ENI                     = "true"
POD_SECURITY_GROUP_ENFORCING_MODE  = "strict"
AWS_VPC_K8S_CNI_EXTERNALSNAT       = "true"

With this configuration, ipamd v1.15.5 works as expected - all ENIs except for the first one are placed in the subnets and get SG assigned as specified by ENIConfig.

In v1.16.2, the second ENI described as aws-k8s-trunk-eni is always created in the same subnet as the node itself and gets the node's SG attached (i.e. ENIConfig is ignored), but the third ENI named aws-K8S-${instance_id} is configured as expected. Since we have AWS_VPC_K8S_CNI_EXTERNALSNAT=true, pods allocated to the second interface (the one not configured as ENIConfig prescribes) are not SNATed and effectively don't have Internet access.

jdn5126 commented 6 months ago

@stanvit when Custom Networking is configured, the primary ENI is unused. When Security Groups for Pods is configured, the trunk ENI, aws-k8s-trunk-eni, will always have the same Security Group as the primary ENI, as this is required for trunking to be setup properly. Subsequent ENIs will use the ENIConfig CRD, so they will have the Security Group defined in the CRD.

As an aside, when Security Groups for Pods and Custom Networking are configured, ENIs are attached by the VPC Resource Controller.

Regarding AWS_VPC_K8S_CNI_EXTERNALSNAT=true, configuring this means that pod traffic external to the VPC does not get SNAT'ed on the node, as the expectation is that you have configured VPC routes to force it through an egress gateway or NAT.

Everything you described sounds to me like it is working correctly, so I am wondering if you did not mean to configure AWS_VPC_K8S_CNI_EXTERNALSNAT=true? With that set to false, pod traffic from ENIs attached by ENIConfig destined to the Internet would SNAT through the node's primary IP.

sstarcher commented 6 months ago

I have also verified that 1.15.5 also works for us where 1.16.0 does not. I will be testing 1.16.2 soon.

jdn5126 commented 6 months ago

I have also verified that 1.15.5 also works for us where 1.16.0 does not. I will be testing 1.16.2 soon.

Hmm.. that is very strange. v1.16.0 did add IPv6 Security Groups for Pods support, but IPv4 should not have been affected. I see all tests passing without issue. Lmk what you find, and I think we will need controller logs, so we will definitely want to go through the support case.

stanvit commented 6 months ago

@jdn5126, thanks for your answer

I ran a few tests on our cluster where we have both Prefix Delegation and Pod ENI enabled, and the problem boils down to this:

The changed behaviour is problematic for us, as pods that are not using dedicated ENIs are assigned different security groups and subnets depending on the instance type they are launched on. We were using with this configuration for over a your now, the issue was introduced in v1.16.0.

AWS_VPC_K8S_CNI_EXTERNALSNAT=true is intentional as our pods are running in separate private subnets routed through NAT Instances/Gateways

I collected logs for my four test cases (v1.15.5/r7a.medium, v1.15.5/r7i.large, v1.16.2/r7a.medium, v1.16.2/r7i.large) with aws-cni-support.sh, may provide them if there's interest

sstarcher commented 6 months ago

I have opened a support ticket waiting for it to be escalated.

jdn5126 commented 6 months ago

@stanvit Custom Networking + Security Groups for Pods cannot work properly on instances with that support only 2 ENIs, so this makes sense, but the prefixes being assigned to the trunk ENI part should not happen. Did you terminate the nodes after enabling prefix delegation?

I spun up a cluster using v1.15.5, and I do not see the behavior you described, i.e. the trunk ENI has the same security group as the primary ENI, so we are missing something here. Still digging...

jdn5126 commented 6 months ago

@stanvit if you email the node logs to k8s-awscni-triage@amazon.com, we can take a look at them

stanvit commented 6 months ago

@jdn5126 thanks for the email, I just sent all setup details and logs

the prefixes being assigned to the trunk ENI part should not happen

Thinking about this, you're right, but our setup worked like that up until recently.

Did you terminate the nodes after enabling prefix delegation?

I never disabled it, but yes, I was draining and letting nodes to be recreated after every vpc-cni version update

I spun up a cluster using v1.15.5, and I do not see the behavior you described, i.e. the trunk ENI has the same security group as the primary ENI, so we are missing something here

I sent my logs, so hopefully it sheds some light on the issue.

While we're at it: if prefix delegation on trunk interfaces is problematic, is it possible to prevent the trunk interface from being created on the instances with only two ENIs and custom networking enabled, or have some other way to disable trunking on certain nodes by, say, setting vpc.amazonaws.com/has-trunk-attached: false upon node creation?

We would like to keep using Custom networking for pods, Prefix Delegation, have the ability to use Security groups for pods occasionally, and use smaller instances where possible for cost savings.

jdn5126 commented 6 months ago

@stanvit sorry for the delay, I will share my findings here:

r7a.medium - In v1.15.5 and v1.16.2, we are not properly skipping trunk ENIs when determining which ENIs we can allocate new IPs/prefixes to: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.15.5/pkg/ipamd/datastore/data_store.go#L1059 . Since we allocate prefixes on the trunk ENI and add them to the datastore, we start placing pods behind the trunk ENI, which seems like a bug. I think this is a general issue, but it manifests quickly when Security Groups for Pods and Custom Networking are configured and there are only two ENIs.

On the Security Group front, looking at https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.15.5/pkg/awsutils/awsutils.go#L528, we do not touch the Security Groups of ENIs when Custom Networking is enabled. This makes sense to me, as we are relying on the ENIConfig to control the SGs for attached ENIs, and we are relying on the VPC Resource Controller to control the SG for the trunk ENI: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/trunk/trunk.go#L221. The VPC Resource also looks for the ENIConfig, so the only thing that would make sense to me here is that you did not terminate the node after enabling Custom Networking, hence the race condition on what SG was used for the trunk ENI.

Can you try terminating the nodes and try validating the SG on the trunk ENI afterward?

I see a similar story for r7i.large, so that leads me to the following conclusions:

  1. We need to fix https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/datastore/data_store.go#L977 to properly skip trunk interfaces.
  2. When Custom Networking and Security Groups for Pods are configured, an instance with 2 ENIs can only support pods matching security group policies. Since no "normal" (not primary ENI or trunk ENI) ENIs can be attached, the instance has no IPs available for normal pods.
  3. Whenever Custom Networking and/or Security Groups for Pods is configured, the instances need to be terminated, otherwise there is a race condition on what Security Group will be assigned to the trunk ENI.
jdn5126 commented 6 months ago

Internally, I am working with the EKS Networking team to determine what to do about number 1

jdn5126 commented 6 months ago

@stanvit sorry, I forgot to address the other threads:

I never disabled it, but yes, I was draining and letting nodes to be recreated after every vpc-cni version update

Terminating is definitely a requirement, as draining will not detach the trunk ENI.

is it possible to prevent the trunk interface from being created on the instances with only two ENIs and custom networking enabled, or have some other way to disable trunking on certain nodes by, say, setting vpc.amazonaws.com/has-trunk-attached: false upon node creation?

It is possible, but we would need to track this as a new feature request. The request would get more visibility if added at https://github.com/aws/containers-roadmap/issues.

stanvit commented 6 months ago

@jdn5126, thanks for your answers

The VPC Resource also looks for the ENIConfig, so the only thing that would make sense to me here is that you did not terminate the node after enabling Custom Networking, hence the race condition on what SG was used for the trunk ENI.

I never disabled Custom Networking, if by enabling/disabling you mean changing AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG value and restarting aws-node stateful set. It was always enabled during the testing

Can you try terminating the nodes and try validating the SG on the trunk ENI afterward?

Done multiple times, the behaviour is reproducible: with v1.15.5, the trunk interface is configured accordingly with the Custom Networking settings - both SGs and Subnets. I'll collect more data and send it tomorrow (aws describe instances output etc)

I think I might just got onto something though: I tried to label new nodes with vpc.amazonaws.com/has-trunk-attached: "true" hoping to trick VPC Resource Controller into not attaching a Trunk ENI. That didn't help, but instead made v1.15.5 to behave as v1.16.2: the trunk interface still got created though this time it ignored Custom Networking.

jdn5126 commented 6 months ago

@stanvit ok, I think I finally have more of the story. So first, the desired behavior:

This happens in v1.15.5, but in v1.16.2, the Security Group for the trunk ENI may be the same as the one for the primary ENI.

What changed?

This seems like a general reconciliation issue with VPC Resource Controller, so I am engaging that team now.

For the other problem, where prefixes were assigned to the trunk ENI, https://github.com/aws/amazon-vpc-cni-k8s/pull/2801 should fix that. I spun up a cluster with Custom Networking and Security Groups for Pods and validated it.

stanvit commented 6 months ago

@jdn5126, thanks for the update

jdn5126 commented 6 months ago

@jdn5126, thanks for the update

  • Do you need any more experiments/details from my side?
  • So, as I understand it, after Do not allocate IPs or prefixes to trunk ENIs or EFA ENIs #2801 is merged, the prefixes won't be delegated to the Trunk Interfaces and instances that are limited to 2 ENIs won't be able to use Custom Networking and Security Groups for pods at the same, preventing "normal" pods from launching?

I spoke to the VPC Resource Controller team and finally have the full story here. To fix the regression between v1.15.5 and v1.16.x, I am going to revert the order in which IPAMD enables features (Custom Networking before Security Groups for Pods). In parallel, the VPC Resource Controller team is going to explore options to reconciling and updating the Security Group for the trunk ENI, as today the trunk ENI Security Group cannot be changed after creation. So if you change the ENIConfig, the trunk Security Group will not be updated on existing nodes.

For the second part, instances with only 2 ENIs, I am discussing with our Project Manager whether we can mark these instances as invalid for trunk ENIs, so that only "normal" pods will be scheduled on them. If approved, we would treat this as an enhancement to an existing feature.

The VPC CNI fix will go in soon, and will target v1.16.4, which is scheduled to release in early to mid March. We do not need any more details from your end, as we are able to reproduce and understand the issue now. Thank you so much for your patience and help!

jdn5126 commented 6 months ago

https://github.com/aws/amazon-vpc-cni-k8s/pull/2801 contains fixes for two of the issues mentioned here:

I filed https://github.com/aws/amazon-vpc-resource-controller-k8s/issues/373 to cover updating the trunk ENI Security Group when the ENIConfig object changes.

For instances that can support only two ENIs, we are still determining whether it is ok to mark these instances as not eligible for Security Groups for Pods when Custom Networking is enabled.

jdn5126 commented 6 months ago

Closing this issue as the fix has merged and will ship in v1.16.4 early next week. https://github.com/aws/amazon-vpc-cni-k8s/pull/2818 also provides integration test coverage to prevent regressions.

github-actions[bot] commented 6 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one.