aws / amazon-vpc-resource-controller-k8s

Controller for managing Trunk & Branch Network Interfaces on EKS Cluster using Security Group For Pod feature and IPv4 Addresses for Windows Node.
Apache License 2.0
77 stars 53 forks source link

VPC Resource Controller deleting branch ENIs erroneously #412

Closed GnatorX closed 1 week ago

GnatorX commented 4 months ago

Describe the Bug:

Observed Behavior:

At 8:53 PM 
 eni: eni-<id>
   level: info
   logger: branch eni provider
   msg: pushing eni to delete queue as no pod owns it
   node name: Node A
   timestamp: 2024-04-17T20:53:31.278Z

And then at 9:45 PM

{ [-]
   eni: { [-]
     eniId: eni-<id>
     ifAddress: <hidden>
     ipv6Addr:
     privateIp:  <hidden>
     subnetCidr:  <hidden>
     subnetV6Cidr:
     vlanId: 1
   }
   error: eni allocated to pod not found in ec2
   level: error
   logger: branch eni provider
   msg: eni not found
   node name: Node A

During this time our control plane has higher than normal load causing slow response.

Diving deep into the code. What happened is InitTrunk was ran where it GetBranchNetworkInterface then compare it to all the pods it found on the node

https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/trunk/trunk.go#L271

If it found any associated branch ENIs without pods, it deletes those.

The pod is backed by a cached client https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/main.go#L292

The issue here is that during cache hydration (paginated list calls), API server returned a 410. This occurs because list calls took longer than etcd compaction interval.

{ [-]
   URI: /api/v1/pods?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MzQyMjE0MDQxMDQsInN0YXJ0IjoidmVyaWZpY2F0aW9uYm94L3ZlcmlmaWNhdGlvbi1zcnYtYmx1ZS01ODU3NGNiNWZjLThzdzZ0XHUwMDAwIn0&limit=100
   caller: httplog/httplog.go:129
   latency: 2.863053ms
   msg: HTTP
   resp: 410
   userAgent: kube-vpc-resource-controller
   v: 0
   verb: LIST
}
{ [-]
   attempt: 0
   caller: v3/retry_interceptor.go:62
   error: rpc error: code = OutOfRange desc = etcdserver: mvcc: required revision has been compacted
   level: warn
   logger: etcd-client
   msg: retrying of unary invoker failed
   target: etcd-endpoints://0xc001d19340/#initially=[https://etcdbox.northwest.prod.stripe.io:2403]

}

This stops the list call but controller runtime allows controller runtime to move forward thinking cache is fully hydrated. This means VPC RC had a partial cache of pods. This caused VPC RC to delete branch ENIs for pods it didn't find.

Expected Behavior: This pose 2 major issues.

  1. VPC RC should check if pod exist from API server before deleting.
  2. Pod's branch ENI is deleted without recovery https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/8fe171a008095145d463a91513ac277d9b639b4a/pkg/provider/branch/trunk/trunk.go#L257

Lastly, if pods are found in this state why isn't VPC RC attempting to re-associate and new branch ENI and leaving the pod running in a broken state?

How to reproduce it (as minimally and precisely as possible): This is likely diffcult to reproduce and requires a lot of pressure on API server to show. We were load testing scaling from 0 to around 4k nodes with VPC CNI and VPC RC running on a 5 node control plane of m5.8xlarge where we saw 100% CPU and memory on k8s 1.23

Additional Context:

Environment:

sushrk commented 4 months ago

Hey @GnatorX, can you share the cluster ARN to ravsushm@amazon.com? Also if you have the pod name or ENI ID of the branch ENI that was deleted so I can look into the logs and investigate this further.

GnatorX commented 4 months ago

This isn't on EKS so we don't really have an ARN. However looking at the code I no longer know for sure why pod wasn't found however I do believe API server being slow made this issue show up. I am not familiar enough with how the DeltaFIFO and optimizedListwatch works. I see that the the optimized list watch does pagination but doesn't explicitly perform the pagination in the list func definition https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/controllers/custom/custom_controller.go#L184. Does the informer or the client automatically do this?

I still believe there was an inconsistently in the cache when this occurred and that self healing should be done but I don't know the full mechanism of why

sushrk commented 4 months ago

Oh right, are you able to share the controller logs?

garvinp-stripe commented 4 months ago

I created a support ticket with the information

sushrk commented 4 months ago

Shows that this was started from the pod controller and not the node controller.

The nodes are added to be managed by the Node manager, called from node reconciler, see comment

I tried to reproduce the issue with ~9k pods, can confirm that node reconcile does not happen until the pod datastore has completed sync

{"level":"info","timestamp":"2024-04-30T07:47:51.403Z","logger":"controllers.Node","msg":"waiting for pod datastore to sync"}
{"level":"info","timestamp":"2024-04-30T07:47:51.926Z","logger":"controllers.Pod Reconciler.custom pod controller","msg":"cache has synced successfully"}
{"level":"info","timestamp":"2024-04-30T07:47:51.926Z","logger":"controllers.Pod Reconciler.custom pod controller","msg":"Starting Workers","worker count":10}
{"level":"info","timestamp":"2024-04-30T07:47:52.395Z","logger":"controllers.Node","msg":"adding node","node":{"name":"ip-192-168-201-249.us-west-2.compute.internal"}}

I see there were 4k nodes on this cluster. What is the scale of the pods using security group for pods on cluster? Are all nodes enabled for this feature? Could you share the full controller logs as it is not clear from the provided partial info if the node reconcilation was triggered before the pod datastore was synced.

GnatorX commented 4 months ago

I believe you are correct, looking at the logs I believe the pod datastore is synced prior to node controller starting. I do believe something is wrong but I can't be tell what. From what I can tell VPC RC did not find pods on nodes that has existed for a while. It is worth noting that during the incident, we were scaling up from 1.5k nodes to 3k nodes, our k8s API server is under provisioned so it saw high CPU and memory and some VPC RCs were restarted.

Let me see if I can get the full log during that time.

GnatorX commented 4 months ago

However I think without fully finding the root cause, I am wondering if it make sense for VPC RC to perform destructive actions based purely on cached state of the world. Specifically, its likely find to identify candidates for certain action via the cache but VPC RC should verify directly with the API server prior to taking an action.

In addition, when VPC RC find a pod where the branch ENI is now missing shouldn't VPC RC re-attempt reattaching a branch ENI? When a pod is alive without a branch ENI in this way it still thinks it is running fine, I believe kubelet things it is healthy because local health check still works however all incoming calls to this pod is now broken because the IP the pod has is now used else where or not used.

sushrk commented 4 months ago

We have considered querying API server directly to list pods and bypass the cache previously. However, at large scale, list calls are quite expensive on the API server, etcd, and the client. It would likely cause API throttling with the API priority and fairness.

when VPC RC find a pod where the branch ENI is now missing shouldn't VPC RC re-attempt reattaching a branch ENI

Was there a controller restart during the scale operations? The local cache will be rebuilt on controller start by comparing the list of running pods on node and branch ENIs from the EC2 API.

pushing eni to delete queue as no pod owns it -- this suggests at the time the pod was not in running/pending state, so the controller pushed it to delete queue and branch ENI was deleted after the cooldown period.

eni allocated to pod not found in ec2 -- at this time, the branch ENI is deleted and does not exist in EC2.

Is this the same ENI? In the logs shared, I do not see this log for any of the ENIs. Can you share the controller logs from your deployment at the time of the issue? Also, what is the controller version deployed in this cluster?

GnatorX commented 4 months ago

We have considered querying API server directly to list pods and bypass the cache previously. However, at large scale, list calls are quite expensive on the API server, etcd, and the client. It would likely cause API throttling with the API priority and fairness.

Not list call but during branch ENI deletion rather than relying on the cache VPC RC can calling the GetPod call against the API to verify the presence of the pod.

Was there a controller restart during the scale operations? The local cache will be rebuilt on controller start by comparing the list of running pods on node and branch ENIs from the EC2 API.

Yes

I also updated the ticket with more logs. I was running 1.4.6 version of the controller

GnatorX commented 4 months ago

I notice this: https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/7acb50c6d8bb1ed11d151a38ed361e1128743f34/pkg/k8s/pod/converter.go#L59 which wasn't there in v1.4.6 https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/v1.4.6/pkg/k8s/pod/converter.go#L59

Is this the issue? Are we skipping pods because we are referring to the address the same address where the value is changing every iteration? This means that we could be saving only the last pod's information every "page" for list pods https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/controllers/custom/custom_controller.go#L201

sushrk commented 4 months ago

That's a good point @GnatorX, can you upgrade to v1.4.7 where the issue is fixed? I understand this is difficult to reproduce, but using the latest version is recommended.

I'll need to dig into the code a bit to understand the consequence of this during high API server load, will post an update this week.

GnatorX commented 4 months ago

Updated the AWS ticket with what we believe is the issue

haouc commented 4 months ago

Is this the issue? Are we skipping pods because we are referring to the address the same address w

This shouldn't be an issue even with golang < 1.22. For older version of golang, the for loop doesn't update pointer, but that should only impact tracking on pointers. In this particular code snippet, a new variable is created and appended to the slice, which make the caveat harmless.

haouc commented 4 months ago

I don't think initTrunk gets called before node controller is activated. The controller house keeps branch interfaces and pods during node events not pod events.

GnatorX commented 4 months ago

Sorry, I updated the issue with the root cause we found. Node controller starting before cache sync was incorrect.

GnatorX commented 4 months ago

You can test this by forcing page limit to 1 and attempt to start VPC RC against a large cluster. If you didn't change etcd compaction interval (default is 5 minutes) you might hit it. You could also test by changing compaction interval to 1 minute or something smaller.

GnatorX commented 4 months ago

The issue is because of the optimize list watcher in custom controller. First let's establish that the lowest level list call is going to use the listFunc

Then I am going to skip most of controller-runtime code because its hard to go through all that detail but eventually we call into reflector's Run on start up which runs ListandWatch which falls back to list

Client go then wraps the listfunc in list SimplePageFunc

Then starts listing with ListWithAlloc

That calls into pager's list.

Now let say we hit 410 (ResourceExpired), client go falls back into full list against the API server at a resource version.

This is where the bug occurs because PageFn which is listfunc which will override limit BUT because this is a fallback it immediately returns paginated results https://github.com/kubernetes/client-go/blob/master/tools/pager/pager.go#L124. This results in partial results in the cache.

This follows what we saw.


5/7/24 9:55:15.873 PM | { [-]
   URI: /api/v1/pods?limit=100
   audit-ID: 
   caller: httplog/httplog.go:129
   latency: 187.149689ms
   msg: HTTP
   resp: 200
   srcIP: 10.68.51.85:42784
   ts: 1715118915873.3264
   userAgent: kube-vpc-resource-controller
   v: 0
   verb: LIST
}

5/7/24 9:55:15.685 PM | { [-]
   URI: /api/v1/pods?continue={continuetoken}&limit=100
   audit-ID: 
   caller: httplog/httplog.go:129
   latency: 2.863053ms
   msg: HTTP
   resp: 410
   srcIP: 10.68.51.85:42784
   ts: 1715118915685.3071
   userAgent: kube-vpc-resource-controller
   v: 0
   verb: LIST
}
sushrk commented 4 months ago

Thanks, I'm looking into this, will reproduce with the steps you shared above.

IIUC this is the issue? That we should not be returning the partial list if error is not nil, specifically when we get a 410?

GnatorX commented 3 months ago

The issue is actually https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/f61320b5401f0fe0c02c25e5bd3ad0066ef8394f/controllers/custom/custom_controller.go#L191

What happens during 410 is that client go has a fallback that performs a full list call against the API server but because of the optimize list watcher, it overrides the limit causing the fallback to own list the limit you set.

410 occurs when etcd compaction occurs and removes the resource version that you started the list call from. https://kubernetes.io/docs/reference/using-api/api-concepts/#410-gone-responses.

GnatorX commented 3 months ago

Let me know if it would be better if we jump on a call to go over this

sushrk commented 3 months ago

Sorry for the late response, I haven't had a chance to take a look at this yet. Did you encounter the issue again? I plan to reproduce the issue over the weekend and will post an update ASAP.

GnatorX commented 3 months ago

Ya we had it reoccur. One way to test is change the optimizedlistwatcher to sleep for 5+ minutes between pages (assuming your etcd compaction interval is default). This should trigger the 410 from API server. Then you can inspect your cache by listing all pods in cache to see

sushrk commented 3 months ago

Hi @GnatorX, sorry for the delay in getting back on this. I tried to reproduce this by adding a 6min sleep in the ListFunc here. Initially I tried in a test cluster with only few pods and wasn't able to reproduce. I scaled up the workload to 100 pods, and still don't see the issue. Via EC2, I can see there are still 100 branch ENIs in the account belonging to the cluster

kubectl get pods -A | grep sgpp | wc -l
     100

Screenshot 2024-06-06 at 12 54 50

Did you also have to update the page size to reproduce this? What is the pod/node scale on your cluster? Is it always reproducible? Can you also share the 410 error you are observing?

GnatorX commented 3 months ago

If this is a test cluster it might be hard to reproduce. One thing you might have to do is make writes happen on etcd. I would say you would add the sleep + on the side you need to create new pods on the cluster so the resource version get increased. Since etcd compaction would compact all version prior to latest. I don't think changing page size here makes a difference.

yash97 commented 1 month ago

@GnatorX, thank you for your valuable insights in reproducing this bug. I've successfully replicated the issue and submitted a PR to address it. Given your deep understanding of the problem, I'd appreciate your review of the PR.

GnatorX commented 1 month ago

For sure let me take a look

mwos-sl commented 1 month ago

Most probably we're hitting the same for pods using Security Groups for Pod feature. We see plenty running pods without eni. I mean annotation in pod manifest point to some ENI id, but then when checking in AWS Console - such ENI does not exist. Currently it affects 2% of our pods.

haouc commented 1 month ago

@mwos-sl thanks for reporting a similar case. We are working on the PR to close on this issue.

GnatorX commented 1 week ago

We can close this issue against https://github.com/aws/amazon-vpc-resource-controller-k8s/pull/452